1. Evaluation Strategy
Before looking into specific evaluation areas, it’s important to first have a high-level strategy. Once the goal is clear, you can work and produce quality results in a more structured manner. This section covers how an end-to-end evaluation plan should look — the approach you should follow to test the credibility of your Gen AI application.
The following are the key areas we cover:
- Evaluation Methods — Manual vs. automated, and which automated approach to pick
- Test Set Design — How to build a test set that covers your application’s full surface area
- Dataset Generation — Where the test cases come from
- Evaluation Design — How to turn evaluation results into actionable decisions
Evaluation Methods
There are two broad categories: manual (human review) and automated (rule-based or LLM-as-judge).
Manual — Human Review
Manual evaluation cannot be fully replaced. A part of review will always be required, even if you are using automated approaches — a human feedback loop is necessary. The person reviewing the LLM output should be well-versed in what they are building.
The downside is that manual QA takes significantly more time and in many cases is simply not feasible. Consider a RAG application with 10,000 documents — a manual reviewer, or even a team of them, won’t be able to fully cover it.
| Method | When to Use | Pros | Cons |
|---|---|---|---|
| Human Review | High-stakes decisions, edge cases, calibrating automated judges | Gold-standard quality signal; catches subtle issues machines miss | Expensive, slow, doesn’t scale; inter-rater disagreement |
Automated
To scale evaluation, we need automated approaches. There are two major ones:
Rule-Based
Rule-based methods use deterministic checks like cosine similarity, regex matching, keyword presence, or schema validation. For example, you compare the LLM output against a ground truth using embedding similarity and threshold it — if the score is above the threshold, it passes; otherwise it fails.
The problem is that the threshold can be unreliable. Because LLM output is non-deterministic, even if the response is accurate, it can still have a low similarity score if the wording is different from the ground truth. And even if you do get results, you’d still need a complete manual review to verify them — there’s no reasoning attached to explain why something passed or failed.
| Method | When to Use | Pros | Cons |
|---|---|---|---|
| Rule-Based | Deterministic checks — format, schema, keyword presence, regex, similarity | Fast, cheap, reproducible, zero ambiguity | Can’t handle open-ended quality; threshold is fragile with non-deterministic outputs; no reasoning |
LLM-as-Judge
A better approach to test an LLM application is to use an LLM itself as the evaluator. Why? Because it can do verifications based on your provided rules, and it won’t just look at things word-by-word — it evaluates by meaning. This is critical because the response can be completely correct but worded entirely differently from the ground truth. On top of that, the LLM judge can generate reasoning for why it made a certain decision, which helps enormously with debugging.
This approach is scalable. That said, using an LLM as a judge comes at a cost — each evaluation call is an LLM API call.
Throughout this evaluation guide, we have included many prompt templates that you can use for your own use cases.
For a detailed deep-dive on LLM-as-Judge — model selection, prompt design, calibration, bias mitigation, and cost control — refer to: Why Use LLM as a Judge: A Complete Guide for Software Engineers
| Method | When to Use | Pros | Cons |
|---|---|---|---|
| LLM as a Judge | Subjective quality, tone, coherence, faithfulness, any open-ended assessment | Scales to thousands of cases; evaluates by meaning not just words; generates reasoning | Requires calibration; judge model bias; cost per call |
Code: This repo includes a shared LLM judge that supports OpenAI, Anthropic, and Groq providers:
from examples.common.llm_judge import LLMJudge
judge = LLMJudge(provider="openai", model="gpt-4o", api_key="your-api-key")
result = judge.judge("Rate the factual accuracy of this response: ...")
# result: {"score": 1, "reason": "All claims are supported by the ground truth."}
Test Set Design
A good test set is the backbone of reliable evaluation. It must cover the full surface area of your application.
Functional Coverage
| Dimension | Description | Example |
|---|---|---|
| Representative Scenarios (per capability) | At least one test case per feature/capability your app supports | RAG app → one case per document type; API app → one case per endpoint |
| Variants / Corners | Vary difficulty, modality, persona, and prompt style to stress-test generalization | Easy vs. hard queries; formal vs. casual tone; short vs. long context |
| Negative Controls | Queries the app should refuse — verify graceful refusal | Out-of-scope questions; adversarial prompts; gibberish input |
Robustness & Safety Suite
| Dimension | Description | Example |
|---|---|---|
| Edge Cases | Boundary conditions, empty inputs, extremely long context, ambiguous queries | Empty query; 100K-token context; query with no clear intent |
| Stress Tests | High-volume, concurrent, or resource-intensive scenarios | 50 parallel requests; context window near max capacity |
| Adversarial Cases | Malicious attempts to break the system | Prompt injection; jailbreaking; data exfiltration attempts |
Tip: Store test cases as JSONL files — one JSON object per line. Easy to version, diff, and load.
Dataset Generation
Where do the test cases come from? Use a mix of sources for coverage and realism.
| Source | Description | Best For |
|---|---|---|
| Golden Dataset | Hand-curated, expert-verified input-output pairs | Baseline accuracy measurement; regression testing |
| Human-Labeled Data | Real or synthetic inputs annotated by human reviewers | Calibrating LLM judges; subjective quality dimensions |
| Synthetic Data | LLM-generated test cases (with human spot-checks) | Scaling coverage quickly; generating edge cases and variants |
| Real User Data | Anonymized production logs and user interactions | Ensuring evaluation reflects actual usage patterns |
Tip: Start with a small golden dataset (50–100 cases), expand with synthetic data, and continuously enrich with anonymized real user data as the app matures.
Evaluation Design
How do you turn raw evaluation signals into actionable decisions?
| Approach | Description | When to Use |
|---|---|---|
| Pass / Fail | Binary threshold — the test case either passes or fails | Deterministic checks (format, schema, security); safety guardrails |
| Scoring | Numeric scale (e.g., 1–5, 0–100%) with defined rubrics | Quality dimensions (faithfulness, coherence, completeness) |
| Human Review | Qualitative assessment by a human reviewer | Edge cases; calibrating automated scoring; high-stakes decisions |
Example — What an Evaluation Looks Like (RAG)
Here’s a concrete example of what an evaluation result looks like for a RAG application using LLM-as-judge:
| Query | Ground Truth | Actual Response | Pass/Fail | Reasoning |
|---|---|---|---|---|
| What are Python decorators and how do they work? | Python decorators are functions that modify the behavior of other functions using the @syntax. They are used for logging, authentication, and caching. Built-in decorators include @staticmethod, @classmethod, and @property. | Python decorators use the @syntax to modify function behavior. They are commonly used for logging, authentication, and caching. Examples of built-in decorators are @staticmethod, @classmethod, and @property. | ✅ PASS | All claims in the response are supported by the ground truth. The wording differs but the meaning is equivalent. |
| How do database indexes improve query performance? | Database indexes speed up data retrieval by avoiding full table scans. B-tree indexes are the most common type used for range queries and exact matches. | Database indexes speed up data retrieval by avoiding full table scans. B-tree indexes are commonly used. Indexes also add write overhead because they must be updated on every insert, update, or delete. | ❌ FAIL | The response introduces a claim about write overhead that is not present in the ground truth. Two of three claims are supported, but the unsupported claim makes this a failure. |
This is the kind of output every evaluator in this guide produces — a clear pass/fail verdict with reasoning you can review and debug.
← Home: Gen AI Applications Evaluation Guidelines · Next: 2. Accuracy →