Memory

Evaluate whether the system correctly stores, recalls, and updates information across conversations.

Metric What It Evaluates What Goes Wrong
Memory Correctness Did the system store exactly what the user said? Wrong value stored, hallucinated details added, key information missing
Recall Relevance When the system retrieves memory, is it the right memory at the right time? Irrelevant memories surface, useful memories ignored
Update Correctness When the user corrects something, does memory update properly? Old value kept alongside new, update ignored, unrelated memory deleted

Memory Correctness

Manual Examples:

User Input Stored Memory Expected Result
“My favorite color is blue.” favorite_color = blue favorite_color = blue ✅ PASS — stored exactly what user said
“My favorite color is blue.” favorite_color = green favorite_color = blue ❌ FAIL — wrong value stored
“My project deadline is May 5.” deadline = May 5, 2024 at 5:00 PM deadline = May 5 ❌ FAIL — hallucinated time and year user never said
“I have 2 kids, ages 5 and 8.” kids_count = 2 (ages missing) kids_count = 2, kids_ages = [5, 8] ❌ FAIL — missing key information

Recall Relevance

Manual Examples:

Current Query Retrieved Memory Full Memory Store Result
“Suggest outfits I might like.” favorite_color = blue favorite_color, favorite_food, city ✅ PASS — relevant memory retrieved
“Suggest outfits I might like.” favorite_food = pizza favorite_color, favorite_food, city ❌ FAIL — irrelevant memory; missed favorite_color

Update Correctness

Manual Examples:

User Says Memory Before Memory After Result
“Change my favorite color to red.” {favorite_color: blue, city: Karachi} {favorite_color: red, city: Karachi} ✅ PASS — updated correctly, unrelated key preserved
“Change my favorite color to red.” {favorite_color: blue, city: Karachi} {favorite_color: blue, favorite_color_new: red, city: Karachi} ❌ FAIL — kept both old and new
“Change my favorite color to red.” {favorite_color: blue, city: Karachi} {favorite_color: blue, city: Karachi} ❌ FAIL — ignored the update entirely
“Change my favorite color to red.” {favorite_color: blue, city: Karachi} {favorite_color: red} ❌ FAIL — updated correctly but deleted unrelated city

Code

Code: examples/accuracy/memory/memory_context_evaluator.py

from examples.accuracy.memory.memory_context_evaluator import (
    evaluate_memory_correctness,
    evaluate_recall_relevance,
    evaluate_update_correctness,
)

# 1. Memory Correctness — LLM-as-judge
result = evaluate_memory_correctness(
    judge=judge,
    user_input="My favorite color is blue.",
    stored_memory={"favorite_color": "green"},
    expected_memory={"favorite_color": "blue"},
)
# {"passed": False, "reason": "...", "issues": ["Wrong value for 'favorite_color': ..."]}

# 2. Recall Relevance — LLM-as-judge
result = evaluate_recall_relevance(
    judge=judge,
    query="Suggest outfits I might like.",
    retrieved_memories=[{"key": "favorite_color", "value": "blue"}],
    all_memories=[
        {"key": "favorite_color", "value": "blue"},
        {"key": "favorite_food", "value": "pizza"},
        {"key": "city", "value": "Karachi"},
    ],
)
# {"passed": True, "reason": "...", "relevant_memories": [...], "irrelevant_memories": [], "missed_memories": []}

# 3. Update Correctness — LLM-as-judge
result = evaluate_update_correctness(
    judge=judge,
    memory_before={"favorite_color": "blue", "city": "Karachi"},
    update_instruction="Actually, change my favorite color to red.",
    memory_after={"favorite_color": "red", "city": "Karachi"},
    expected_memory_after={"favorite_color": "red", "city": "Karachi"},
)
# {"passed": True, "reason": "...", "issues": []}

Back to top

Copyright © 2026 Emumba. Distributed under the MIT License.

This site uses Just the Docs, a documentation theme for Jekyll.