Response Quality
Evaluate the quality of the LLM’s generated output regardless of where the context came from.
Task Quality
Does the response actually do what the user asked for? This evaluates intent fulfillment — not whether the answer is factually correct (that’s Factuality), but whether the response addresses every part of the user’s request with a concrete, actionable answer.
Manual Examples:
| Query | Generated Response | Result |
|---|---|---|
| Show me all orders placed by customers from Pakistan that are still pending. | SQL query with correct JOIN and WHERE clauses for country and status | ✅ PASS — response delivers the requested artifact with both conditions |
| Show me all orders placed by customers from Pakistan that are still pending. | SQL query filters by country but omits the status filter | ❌ FAIL — misses part of the user’s intent |
| Summarize last week’s sales and list the top 3 products by revenue. | Provides the sales summary but omits the top 3 products list | ❌ FAIL — only half the request is fulfilled |
| What is the refund policy for cancelled flights? | “Refund policies can vary. I’d recommend checking the airline’s website.” | ❌ FAIL — vague deflection, doesn’t answer the question |
Note: This evaluator does NOT require ground truth. It judges the response solely against the user’s query.
Code: examples/accuracy/response_quality/task_quality_evaluator.py
from examples.accuracy.response_quality.task_quality_evaluator import evaluate_task_quality
result = evaluate_task_quality(
judge=judge,
query="Summarize last week's sales and list the top 3 products by revenue.",
generated_response="Last week's total sales were $42,350 across 1,204 orders..."
)
print(result) # {"passed": False, "reason": "...", "missed_intents": ["top 3 products by revenue"]}
Instruction Following
Does the response follow the rules and constraints defined in the system prompt?
Manual Examples:
| System Prompt Rule | User Message | Response | Result |
|---|---|---|---|
| Never reveal whether a candidate’s answer is correct | “I would reverse the linked list iteratively.” | “Can you walk me through what happens to the pointers at each step?” | ✅ PASS — withholds judgment, asks follow-up |
| Never reveal whether a candidate’s answer is correct | “I would sort the array first.” | “That’s exactly right! Great job!” | ❌ FAIL — explicitly confirms correctness |
Code: examples/accuracy/response_quality/instruction_following_evaluator.py
from examples.accuracy.response_quality.instruction_following_evaluator import evaluate_response_accuracy
result = evaluate_response_accuracy(
judge=judge,
system_prompt="Never reveal whether answers are correct...",
user_message="I would reverse the linked list iteratively.",
assistant_response="Can you walk me through your approach?"
)
Factuality
Are the claims in the response factually correct when compared against a known ground truth answer? This requires a ground truth — use Grounded Accuracy if you want to check claims against retrieved context instead.
Manual Examples:
| Response Claim | Ground Truth | Result |
|---|---|---|
| “B-tree indexes are used for range queries” | Ground truth confirms B-tree for range queries | ✅ PASS — factually supported |
| “Hash indexes are used for range queries” | Ground truth says B-tree for ranges, hash for equality | ❌ FAIL — factual error |
Code: examples/accuracy/response_quality/factuality_evaluator.py
from examples.accuracy.response_quality.factuality_evaluator import evaluate_factuality
result = evaluate_factuality(
judge=judge,
query="How do database indexes improve query performance?",
generated_response="Hash indexes are used for range queries...",
ground_truth="B-tree indexes handle range queries; hash indexes handle equality..."
)
print(result) # {"passed": False, "score": 0.5, "unsupported_claims": ["Hash indexes are used for range queries"]}
Consistency
Does the model produce consistent outputs for similar queries? Repeated queries should not yield contradictory answers.
Manual Examples:
| Query | Response 1 | Response 2 | Result |
|---|---|---|---|
| What is a Python decorator? | “A function that wraps another function using @syntax” | “A function that modifies behavior of another function via @syntax” | ✅ PASS — consistent |
| Does Python support multi-threading for CPU tasks? | “Yes, threading is effective” | “No, the GIL prevents parallel execution” | ❌ FAIL — contradictory |
Code: examples/accuracy/response_quality/consistency_evaluator.py
from examples.accuracy.response_quality.consistency_evaluator import evaluate_consistency
result = evaluate_consistency(
judge=judge,
query="What is a Python decorator?",
responses=["A function that wraps...", "A function that modifies..."]
)