Context Sourcing

Evaluate how well the system retrieves and provides the right context for generating answers.


RAG Sourcing

When evidence comes from vector/keyword indices, measure retrieval quality with two complementary metrics:

Metric What It Measures Why It Matters
Recall Of all the chunks that should have been retrieved, how many were actually found? Low recall = the LLM is missing critical information it needs to answer correctly
Precision Of all the chunks the system did retrieve, how many are actually relevant? Low precision = noise and off-topic chunks waste the context window and can confuse the LLM

Manual Examples:

Query Ground Truth (3 chunks) Retrieved (2 chunks) Recall Precision Verdict
What are the health insurance benefits? ① Eligible after 90 days ② Company covers 80% of premium ③ Dental/vision available for $45/mo ① Eligible after 90 days + ❌ Cafeteria hours (off-topic) 1/3 = 0.33 1/2 = 0.50 ❌ FAIL — missed 2 of 3 expected chunks and retrieved noise
How do I request annual leave? ① Submit through self-service portal ② Manager approval 5 days in advance ① Submit through portal ② Manager approval 5 days in advance 2/2 = 1.0 2/2 = 1.0 ✅ PASS — all expected chunks found, no noise
How do I request annual leave? ① Submit through self-service portal ② Manager approval 5 days in advance ① Submit through portal ② Manager approval + ❌ Company founding date + ❌ Parking info 2/2 = 1.0 2/4 = 0.50 ❌ FAIL — all info found but half the retrieved chunks are noise

Code: examples/accuracy/context_sourcing/rag_retrieval_evaluator.py

from examples.accuracy.context_sourcing.rag_retrieval_evaluator import (
    evaluate_recall,
    evaluate_precision,
    evaluate_retrieval,   # runs both in one call
)

# Recall only — did we find all the expected chunks?
recall_result = evaluate_recall(
    retrieved_chunks=["Eligible after 90 days...", "Cafeteria hours..."],
    ground_truth_chunks=["Eligible after 90 days...", "Company covers 80%...", "Dental/vision for $45..."]
)
print(recall_result)  # {"passed": False, "recall": 0.333, "missing_chunks": [...]}

# Precision only — are the retrieved chunks relevant?
precision_result = evaluate_precision(
    retrieved_chunks=["Submit through portal...", "Manager approval...", "Company founded in 2003...", "Parking info..."],
    ground_truth_chunks=["Submit through portal...", "Manager approval..."]
)
print(precision_result)  # {"passed": False, "precision": 0.5, "noise_chunks": [...]}

# Combined — both recall and precision in one call
result = evaluate_retrieval(
    retrieved_chunks=["Submit through portal...", "Manager approval..."],
    ground_truth_chunks=["Submit through portal...", "Manager approval..."]
)
print(result)  # {"passed": True, "recall": 1.0, "precision": 1.0}

Non-RAG Sourcing (Tools / DB / API)

Not all systems use RAG. Many fetch data from APIs, databases, or programmatic tools instead. The exact failure modes and evaluation approach depend on your architecture, but three common areas to evaluate are:

Area What It Evaluates Why It Matters
API Selection Did the system pick the correct endpoint? Wrong endpoint = entirely wrong data, no downstream fix possible
Parameter Accuracy Did it pass the correct query params / filters? Right endpoint + wrong parameters = wrong results (e.g., wrong date, wrong passenger count)
Query Generation Did it produce a correct database query (SQL/NoSQL)? Wrong JOINs, missing WHERE clauses, or bad aggregations silently return incorrect data

Note: These are starting examples. Non-RAG sourcing can vary widely — GraphQL queries, gRPC calls, multi-step tool chains, etc. For simple cases, API Selection and Parameter checks are straightforward ground-truth comparisons (no LLM needed). For more complex systems with ambiguous routing logic, multiple valid endpoints, or dynamic schemas, you can substitute an LLM-as-judge approach instead.

Manual Examples — API Selection:

Query Selected Ground Truth Result
Find flights from Karachi to Dubai next Friday. flight_search flight_search ✅ PASS — correct endpoint
Find flights from Karachi to Dubai next Friday. weather flight_search ❌ FAIL — wrong endpoint entirely

Manual Examples — Parameter Accuracy:

Query Generated Params Ground Truth Params Result
Economy flights KHI→DXB on March 28 for 2 passengers. {origin: KHI, dest: DXB, date: 2026-03-28, pax: 2, class: economy} Same ✅ PASS — all params correct
Flights LHE→IST for 3 passengers on April 5. {origin: LHE, dest: IST, date: 2026-04-05, pax: 1, class: business} {origin: LHE, dest: IST, date: 2026-04-05, pax: 3} ❌ FAIL — wrong passenger count, invented cabin class

Manual Examples — Query Generation:

Query Generated SQL Result
Show all pending orders from customers in Pakistan. SELECT ... FROM orders o JOIN customers c ... WHERE c.country = 'Pakistan' AND o.status = 'pending' ✅ PASS — both filters present
Show all pending orders from customers in Pakistan. SELECT ... FROM orders o JOIN customers c ... WHERE c.country = 'Pakistan' ❌ FAIL — missing status filter, returns all orders instead of just pending

Code: examples/accuracy/context_sourcing/non_rag_sourcing_evaluator.py

from examples.accuracy.context_sourcing.non_rag_sourcing_evaluator import (
    evaluate_api_selection,
    evaluate_parameter_accuracy,
    evaluate_query_generation,
)

# 1. API Selection — simple ground-truth comparison (no LLM needed)
result = evaluate_api_selection(
    selected_api="flight_search",
    ground_truth_api="flight_search",
)
# {"passed": True, "reason": "Correct endpoint selected: 'flight_search'."}

# 2. Parameter Accuracy — simple dict comparison (no LLM needed)
result = evaluate_parameter_accuracy(
    generated_params={"origin": "KHI", "destination": "DXB", "departure_date": "2026-03-28", "passengers": 2},
    ground_truth_params={"origin": "KHI", "destination": "DXB", "departure_date": "2026-03-28", "passengers": 2},
)
# {"passed": True, "reason": "All parameters match the ground truth.", "issues": []}

# 3. Query Generation — LLM-as-judge (user query + generated SQL)
result = evaluate_query_generation(
    judge=judge,
    query="Show all pending orders from customers in Pakistan",
    generated_query="SELECT o.order_id, o.total_amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE c.country = 'Pakistan' AND o.status = 'pending'",
)
# {"passed": True, "reason": "...", "issues": []}

Back to top

Copyright © 2026 Emumba. Distributed under the MIT License.

This site uses Just the Docs, a documentation theme for Jekyll.