Memory

Evaluate whether the system correctly stores, recalls, and updates information across conversations.

Metric	What It Evaluates	What Goes Wrong
Memory Correctness	Did the system store exactly what the user said?	Wrong value stored, hallucinated details added, key information missing
Recall Relevance	When the system retrieves memory, is it the right memory at the right time?	Irrelevant memories surface, useful memories ignored
Update Correctness	When the user corrects something, does memory update properly?	Old value kept alongside new, update ignored, unrelated memory deleted

Memory Correctness

Manual Examples:

User Input	Stored Memory	Expected	Result
“My favorite color is blue.”	`favorite_color = blue`	`favorite_color = blue`	✅ PASS — stored exactly what user said
“My favorite color is blue.”	`favorite_color = green`	`favorite_color = blue`	❌ FAIL — wrong value stored
“My project deadline is May 5.”	`deadline = May 5, 2024 at 5:00 PM`	`deadline = May 5`	❌ FAIL — hallucinated time and year user never said
“I have 2 kids, ages 5 and 8.”	`kids_count = 2` (ages missing)	`kids_count = 2, kids_ages = [5, 8]`	❌ FAIL — missing key information

Recall Relevance

Manual Examples:

Current Query	Retrieved Memory	Full Memory Store	Result
“Suggest outfits I might like.”	`favorite_color = blue`	favorite_color, favorite_food, city	✅ PASS — relevant memory retrieved
“Suggest outfits I might like.”	`favorite_food = pizza`	favorite_color, favorite_food, city	❌ FAIL — irrelevant memory; missed favorite_color

Update Correctness

Manual Examples:

User Says	Memory Before	Memory After	Result
“Change my favorite color to red.”	`{favorite_color: blue, city: Karachi}`	`{favorite_color: red, city: Karachi}`	✅ PASS — updated correctly, unrelated key preserved
“Change my favorite color to red.”	`{favorite_color: blue, city: Karachi}`	`{favorite_color: blue, favorite_color_new: red, city: Karachi}`	❌ FAIL — kept both old and new
“Change my favorite color to red.”	`{favorite_color: blue, city: Karachi}`	`{favorite_color: blue, city: Karachi}`	❌ FAIL — ignored the update entirely
“Change my favorite color to red.”	`{favorite_color: blue, city: Karachi}`	`{favorite_color: red}`	❌ FAIL — updated correctly but deleted unrelated `city`

Code

Code: examples/accuracy/memory/memory_context_evaluator.py

from examples.accuracy.memory.memory_context_evaluator import (
    evaluate_memory_correctness,
    evaluate_recall_relevance,
    evaluate_update_correctness,
)

# 1. Memory Correctness — LLM-as-judge
result = evaluate_memory_correctness(
    judge=judge,
    user_input="My favorite color is blue.",
    stored_memory={"favorite_color": "green"},
    expected_memory={"favorite_color": "blue"},
)
# {"passed": False, "reason": "...", "issues": ["Wrong value for 'favorite_color': ..."]}

# 2. Recall Relevance — LLM-as-judge
result = evaluate_recall_relevance(
    judge=judge,
    query="Suggest outfits I might like.",
    retrieved_memories=[{"key": "favorite_color", "value": "blue"}],
    all_memories=[
        {"key": "favorite_color", "value": "blue"},
        {"key": "favorite_food", "value": "pizza"},
        {"key": "city", "value": "Karachi"},
    ],
)
# {"passed": True, "reason": "...", "relevant_memories": [...], "irrelevant_memories": [], "missed_memories": []}

# 3. Update Correctness — LLM-as-judge
result = evaluate_update_correctness(
    judge=judge,
    memory_before={"favorite_color": "blue", "city": "Karachi"},
    update_instruction="Actually, change my favorite color to red.",
    memory_after={"favorite_color": "red", "city": "Karachi"},
    expected_memory_after={"favorite_color": "red", "city": "Karachi"},
)
# {"passed": True, "reason": "...", "issues": []}