Types of Evaluation Metrics

Common metric categories: 1. **Accuracy metrics**: Correctness of factual claims 2. **Relevance metrics**: How well outputs match the query 3. **Coherence metrics**: Logical flow and readability 4. **Safety metrics**: Harmful content detection 5. **Format metrics**: Adherence to output structure

Automated Metrics

Machine-computed metrics: - **BLEU/ROUGE**: Text similarity to reference outputs - **Perplexity**: Model confidence in outputs - **Semantic similarity**: Embedding-based comparison - **Exact match**: Character-for-character comparison - **F1 Score**: Precision and recall balance

Human Evaluation

Human judgment metrics: - **Helpfulness ratings**: Did the response help? - **Preference comparisons**: A vs B testing - **Error annotation**: Flagging specific issues - **Quality rubrics**: Structured scoring guides

Custom Metrics for Your Use Case

Build metrics specific to your needs: - Citation accuracy for RAG systems - Code correctness for coding assistants - Tone consistency for customer support - Safety scores for sensitive applications

LLM Evaluation Metrics: How to Measure AI Quality

Definition

Types of Evaluation Metrics

Automated Metrics

Human Evaluation

Custom Metrics for Your Use Case

Further Reading

Related Topics

Quality Scoring

Prompt Testing Best Practices

LLM Output Validation

Put This Knowledge Into Practice