3. Performance

Performance evaluation measures the non-functional characteristics of your LLM application — how fast, how cheap, how scalable, and how resilient it is under real-world conditions.

Latency

Response time is one of the most visible quality signals to end users. Slow responses degrade UX regardless of accuracy.

Metric	What It Measures	Why It Matters
Time to First Token (TTFT)	Time from request submission to the first token appearing in the response stream	Perceived responsiveness — users judge “speed” by when they see the first output, not when the full response completes
End-to-End Latency	Total time from request to complete response delivery	Overall throughput and SLA compliance; critical for synchronous workflows
Throughput	Requests processed per second (RPS) under steady-state load	Capacity planning; determines how many concurrent users the system can serve
Streaming Stability	Consistency of token delivery rate during streamed responses	Choppy streaming (bursts then pauses) feels broken even if total latency is acceptable
Responsiveness	Time from user action (click, enter) to visible system acknowledgment	Includes network, queue, and pre-processing time — not just LLM inference

How to Measure

TTFT: Instrument the streaming callback — timestamp the first on_token event minus the request timestamp.
E2E Latency: Timestamp at request send and response complete. For multi-step agents, sum per-step latencies and add orchestration overhead.
Throughput: Run a controlled load test (see Load Testing below) and measure sustained RPS at acceptable latency percentiles (p50, p95, p99).
Streaming Stability: Measure inter-token intervals during streaming. Flag if standard deviation exceeds a threshold (e.g., >500ms gaps).

Tools

For end-to-end load and latency testing, industry-standard tools like JMeter and k6 are well-suited — they can simulate concurrent users, measure p50/p95/p99 latencies, and stress-test your full pipeline including retrieval, orchestration, and LLM inference.

For chunk-by-chunk streaming latency (TTFT, inter-token intervals, streaming stability), use streamapiperformance — an npm package purpose-built for measuring token-level timing in streamed LLM responses.

Code: examples/performance/latency_evaluator.py — a simple callable-based timing wrapper for measuring E2E latency of any LLM call.

Cost

At production scale, cost becomes one of the most critical performance dimensions. As user volume grows, every unnecessary token, redundant LLM call, and over-fetched context chunk compounds into significant spend. A single query in a multi-agent RAG pipeline can trigger 3–10+ LLM invocations — if you’re not tracking cost per journey, you’re flying blind.

Metric	What It Measures	Why It Matters
Per Model Call	Token cost (input + output) for a single LLM invocation	Baseline unit cost; varies dramatically between models (GPT-4o vs. GPT-4o-mini vs. open-source)
End-to-End Journey Cost	Total cost of resolving a user query, including all LLM calls, retrieval, and tool invocations	Multi-agent and RAG systems often make 3–10+ LLM calls per query; per-call cost alone is misleading
Token Usage Efficiency	Ratio of useful output tokens to total tokens consumed (including system prompts, retries, and context)	Bloated system prompts, unnecessary retries, and over-fetched context silently inflate costs

How to Measure

Per Model Call: Log prompt_tokens and completion_tokens from the API response. Multiply by the model’s per-token pricing.
Journey Cost: Sum all per-call costs across the full agent/RAG pipeline for a single user query. Track as a distribution (p50, p95).
Token Efficiency: (output_tokens) / (total_input_tokens + output_tokens). Low efficiency suggests system prompt bloat or over-retrieval.

Context & Memory Efficiency

For RAG and memory-augmented systems, more context is not always better. There’s a quality curve.

Metric	What It Measures	Why It Matters
Context Size vs Quality Curve	How response quality changes as you increase the number of retrieved chunks (K)	Diminishing returns — going from K=3 to K=10 may add noise without improving accuracy
Memory Size vs Relevance Curve	How memory recall quality degrades as the memory store grows	Older or less relevant memories may pollute the context window

How to Evaluate

Sweep K: Run the same test set with K=1, 3, 5, 10, 20. Plot accuracy (or faithfulness score) vs K.
Find the elbow: Identify the point where adding more chunks stops improving quality.

Load Testing

Evaluate system behavior under realistic and peak traffic conditions.

Metric	What It Measures	Why It Matters
Concurrency Limit	Maximum number of simultaneous requests before latency degrades beyond acceptable thresholds	Capacity planning; determines infrastructure scaling requirements
Peak Load Behavior	System behavior at and beyond capacity — does it degrade gracefully or fail catastrophically?	Determines whether the system queues, throttles, or crashes under burst traffic

How to Test

Use load testing tools (Locust, k6, Artillery) to simulate concurrent users.
Ramp from 1 → N concurrent requests. Record latency percentiles (p50, p95, p99) and error rates at each level.
Identify the concurrency at which p95 latency exceeds your SLA — that’s your effective concurrency limit.
Push 20% beyond that limit and observe: does the system queue gracefully, return 429s, or crash?

Tool recommendation: Locust is Python-native and easy to script custom LLM request patterns. k6 is better for high-volume HTTP benchmarks with built-in dashboards.

Reliability

How does the system behave when things go wrong?

Metric	What It Measures	Why It Matters
Retry Mechanism	Whether failed LLM calls, tool invocations, or retrieval steps are automatically retried with appropriate backoff	Transient failures (rate limits, timeouts) are common; retries prevent unnecessary user-facing errors
Graceful Degradation	System behavior when a component fails — does it fall back to a simpler path or fail entirely?	Users prefer a partial answer over a cryptic error; fallback chains maintain UX under failure conditions

What to Validate

Scenario	Expected Behavior
LLM API returns 429 (rate limited)	Retry with exponential backoff; succeed within 2–3 retries
Primary retrieval service is down	Fall back to cached results or a secondary index
One agent in a multi-agent chain times out	Orchestrator detects the timeout, skips or retries the step, and returns a partial result with a disclaimer
LLM returns unparseable output (malformed JSON)	Retry with a stricter prompt; if still fails, return a structured error to the user

← Previous: 2. Accuracy · Next: 4. Safety →