Context Sourcing

Evaluate how well the system retrieves and provides the right context for generating answers.

RAG Sourcing

When evidence comes from vector/keyword indices, measure retrieval quality with two complementary metrics:

Metric	What It Measures	Why It Matters
Recall	Of all the chunks that should have been retrieved, how many were actually found?	Low recall = the LLM is missing critical information it needs to answer correctly
Precision	Of all the chunks the system did retrieve, how many are actually relevant?	Low precision = noise and off-topic chunks waste the context window and can confuse the LLM

Manual Examples:

Query	Ground Truth (3 chunks)	Retrieved (2 chunks)	Recall	Precision	Verdict
What are the health insurance benefits?	① Eligible after 90 days ② Company covers 80% of premium ③ Dental/vision available for $45/mo	① Eligible after 90 days + ❌ Cafeteria hours (off-topic)	1/3 = 0.33	1/2 = 0.50	❌ FAIL — missed 2 of 3 expected chunks and retrieved noise
How do I request annual leave?	① Submit through self-service portal ② Manager approval 5 days in advance	① Submit through portal ② Manager approval 5 days in advance	2/2 = 1.0	2/2 = 1.0	✅ PASS — all expected chunks found, no noise
How do I request annual leave?	① Submit through self-service portal ② Manager approval 5 days in advance	① Submit through portal ② Manager approval + ❌ Company founding date + ❌ Parking info	2/2 = 1.0	2/4 = 0.50	❌ FAIL — all info found but half the retrieved chunks are noise

Code: examples/accuracy/context_sourcing/rag_retrieval_evaluator.py

from examples.accuracy.context_sourcing.rag_retrieval_evaluator import (
    evaluate_recall,
    evaluate_precision,
    evaluate_retrieval,   # runs both in one call
)

# Recall only — did we find all the expected chunks?
recall_result = evaluate_recall(
    retrieved_chunks=["Eligible after 90 days...", "Cafeteria hours..."],
    ground_truth_chunks=["Eligible after 90 days...", "Company covers 80%...", "Dental/vision for $45..."]
)
print(recall_result)  # {"passed": False, "recall": 0.333, "missing_chunks": [...]}

# Precision only — are the retrieved chunks relevant?
precision_result = evaluate_precision(
    retrieved_chunks=["Submit through portal...", "Manager approval...", "Company founded in 2003...", "Parking info..."],
    ground_truth_chunks=["Submit through portal...", "Manager approval..."]
)
print(precision_result)  # {"passed": False, "precision": 0.5, "noise_chunks": [...]}

# Combined — both recall and precision in one call
result = evaluate_retrieval(
    retrieved_chunks=["Submit through portal...", "Manager approval..."],
    ground_truth_chunks=["Submit through portal...", "Manager approval..."]
)
print(result)  # {"passed": True, "recall": 1.0, "precision": 1.0}

Non-RAG Sourcing (Tools / DB / API)

Not all systems use RAG. Many fetch data from APIs, databases, or programmatic tools instead. The exact failure modes and evaluation approach depend on your architecture, but three common areas to evaluate are:

Area	What It Evaluates	Why It Matters
API Selection	Did the system pick the correct endpoint?	Wrong endpoint = entirely wrong data, no downstream fix possible
Parameter Accuracy	Did it pass the correct query params / filters?	Right endpoint + wrong parameters = wrong results (e.g., wrong date, wrong passenger count)
Query Generation	Did it produce a correct database query (SQL/NoSQL)?	Wrong JOINs, missing WHERE clauses, or bad aggregations silently return incorrect data

Note: These are starting examples. Non-RAG sourcing can vary widely — GraphQL queries, gRPC calls, multi-step tool chains, etc. For simple cases, API Selection and Parameter checks are straightforward ground-truth comparisons (no LLM needed). For more complex systems with ambiguous routing logic, multiple valid endpoints, or dynamic schemas, you can substitute an LLM-as-judge approach instead.

Manual Examples — API Selection:

Query	Selected	Ground Truth	Result
Find flights from Karachi to Dubai next Friday.	flight_search	flight_search	✅ PASS — correct endpoint
Find flights from Karachi to Dubai next Friday.	weather	flight_search	❌ FAIL — wrong endpoint entirely

Manual Examples — Parameter Accuracy:

Query	Generated Params	Ground Truth Params	Result
Economy flights KHI→DXB on March 28 for 2 passengers.	`{origin: KHI, dest: DXB, date: 2026-03-28, pax: 2, class: economy}`	Same	✅ PASS — all params correct
Flights LHE→IST for 3 passengers on April 5.	`{origin: LHE, dest: IST, date: 2026-04-05, pax: 1, class: business}`	`{origin: LHE, dest: IST, date: 2026-04-05, pax: 3}`	❌ FAIL — wrong passenger count, invented cabin class

Manual Examples — Query Generation:

Query	Generated SQL	Result
Show all pending orders from customers in Pakistan.	`SELECT ... FROM orders o JOIN customers c ... WHERE c.country = 'Pakistan' AND o.status = 'pending'`	✅ PASS — both filters present
Show all pending orders from customers in Pakistan.	`SELECT ... FROM orders o JOIN customers c ... WHERE c.country = 'Pakistan'`	❌ FAIL — missing status filter, returns all orders instead of just pending

Code: examples/accuracy/context_sourcing/non_rag_sourcing_evaluator.py

from examples.accuracy.context_sourcing.non_rag_sourcing_evaluator import (
    evaluate_api_selection,
    evaluate_parameter_accuracy,
    evaluate_query_generation,
)

# 1. API Selection — simple ground-truth comparison (no LLM needed)
result = evaluate_api_selection(
    selected_api="flight_search",
    ground_truth_api="flight_search",
)
# {"passed": True, "reason": "Correct endpoint selected: 'flight_search'."}

# 2. Parameter Accuracy — simple dict comparison (no LLM needed)
result = evaluate_parameter_accuracy(
    generated_params={"origin": "KHI", "destination": "DXB", "departure_date": "2026-03-28", "passengers": 2},
    ground_truth_params={"origin": "KHI", "destination": "DXB", "departure_date": "2026-03-28", "passengers": 2},
)
# {"passed": True, "reason": "All parameters match the ground truth.", "issues": []}

# 3. Query Generation — LLM-as-judge (user query + generated SQL)
result = evaluate_query_generation(
    judge=judge,
    query="Show all pending orders from customers in Pakistan",
    generated_query="SELECT o.order_id, o.total_amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE c.country = 'Pakistan' AND o.status = 'pending'",
)
# {"passed": True, "reason": "...", "issues": []}