Why AI eval pipelines fail in production (and how to build eval loops that don’t lie)_
Senior-engineer guide to offline eval pitfalls, production eval loop design, and the logging and failure modes that actually matter for LLM systems.
A year of shipping LLM features tends to produce the same pattern:
Offline evals go green, the dashboard looks impressive, then production users hit a failure that your metrics didn’t predict.
The root cause is almost never “your metric function has a bug.” It’s that your eval pipeline is grading the wrong thing—usually because it’s evaluating a different system than production.
This article is a senior-engineer guide to why that mismatch happens, and how to design production eval loops that stay anchored to real runtime behavior: stage-level replay semantics, evidence logging, and gates that use change-relative baselines and confidence.
Where offline evals lie to you
Offline evals can mislead teams in several ways. The common theme: they evaluate an abstraction (inputs → outputs), but your product is a composition (requests → staged runtime paths → final experiences).
1) They grade “model output” instead of “product behavior”
In production, your response is the result of a pipeline:
prompt policy → retrieval → tool calling → truncation → formatting → post-processing → UI constraints → fallback behavior.
If your offline eval only scores the final text, you will miss regressions where:
- retrieval quality silently degrades, but the judge still likes the prose,
- tool calls fail and the fallback text is “helpful-looking” while being operationally wrong,
- truncation changes meaning, but rubric heuristics don’t penalize the subtle drift.
Senior pattern: evaluate a request trace through the runtime pipeline, not the model in isolation.
2) They assume the world is stationary
Even if you curate your dataset well, offline data freezes a snapshot of:
- user behavior and prompts,
- adversarialness and edge-case frequencies,
- session length and interaction patterns,
- routing decisions in your product.
Production is the moving world.
3) They optimize proxies that don’t map to user success
Rubrics like “helpful,” “factual,” “safe,” “format-correct” are necessary but insufficient because users experience success as operational outcomes, e.g.:
- task completion with no handoff,
- correct tool-grounding,
- stable latency and cost under load,
- consistent refusal reasons and policy alignment.
If your offline objective differs from those operational definitions, you can ship “better scores” that correlate poorly with “fewer tickets.”
4) They avoid nondeterminism and failure-path probability mass
LLM systems fail through paths your offline harness often never exercises:
- retries and temperature variance,
- tool routing edges,
- truncation boundaries,
- “nearly-threshold” retrieval confidence,
- guardrail conflict resolution.
Offline harnesses often run a happy-path, deterministic subset (or a smaller context), so your evaluation underestimates the real-world failure probability.
5) They treat imperfect labels as an oracle
Ground truth labels—benchmarks or internal annotation—contain errors and evolving definitions.
Treat offline labels as evidence that can be stale or biased. The fix is not “throw away offline evals,” it’s to validate against production traces with stage-level attribution.
The golden set trap (especially with RAG)
A common production mismatch:
- teams create a “golden set,”
- freeze it,
- treat it like an oracle.
Users don’t ask questions that match the golden set distribution. They ask:
- messy and underspecified requests,
- contradictory follow-ups,
- adversarial prompts,
- long multi-turn sessions,
- time-sensitive questions with index freshness effects.
With RAG, the dependency chain multiplies the ways offline can miss reality:
- retrieval depends on indexing freshness, chunking, embeddings, query rewriting,
- generation depends on what retrieval returned and how context was injected,
- policy depends on how context and request were assembled (and what was truncated).
So if offline evals only score “final response quality,” they’ll often miss where the failure came from and which segment it affects.
What “production eval” actually means
Production eval is not “run unit tests on prod.” It’s a continuously updated feedback system that answers:
- User/task success: did the request complete in the way users interpret as successful?
- Stage-level failure attribution: where did the degradation originate (retrieval, tool, truncation, guardrails, post-processing)?
- Change impact: when you deploy new prompts/model/tools/rerankers, did failure probability move?
Crucially, the continuous part must be tied back into evaluation and gating—not just dashboards.
A production eval loop that doesn’t lie
The core principle: collect traces from production, replay/evaluate them deterministically where possible, score evidence with the same runtime context, then gate based on segmented risk.
Plane 1: Trace capture (per request, per step)
For each request, capture stage visibility and the configs that shaped behavior. Minimum set of evidence:
- Identifiers
trace_idsession_id_hash(privacy-safe)- tenant/user hash only if needed for stratification (or use coarse segmentation)
- Runtime config (versioned)
model.versionsystem_prompt_hash(or system prompt version ID)- decoding params that affect behavior:
temperature,top_p,max_tokens(and retry policy ID) - tool schema version
- retrieval config version + embedding model version
- RAG details (if enabled)
- query rewrite output hash (store text only for sampled traces)
- doc IDs (or stable doc hashes) + retrieval scores distribution bins
- whether retrieval was empty/low confidence
- whether reranking was applied and with which config ID
- Tool calling details
- chosen tool name + schema version
- tool arguments hash (and redacted args if policy allows)
- tool response status (success/timeout/error)
- tool response snapshot strategy (details below)
- Context boundary
- token counts (input, available completion budget)
- truncation flag
- dropped message/chunk hashes or counts
- Guardrails/policy
- rule IDs / classifier versions
- decision: allow/block/rewrite
- reason codes
- Outputs
- raw model output hash
- final response hash (post-processing)
- parse/format validity + structured error codes
- fallbackUsed flag
Snapshot strategy (important):
- Do not store full tool payloads for every request.
- Store enough to replay scoring without re-executing side-effecting actions.
- The more side-effecting your tool is, the more you must rely on snapshots/hashes instead of re-calls.
Plane 2: Online weak signals (cheap, immediate)
Before running expensive judges, compute cheap operational signals per trace:
- latency by stage (P50/P99 and stage breakdown)
- token usage and cost proxies
- tool error/timeout rates
- JSON/format validity rates
- guardrail block rates and reason codes
- retrieval health: empty rate, top-1 score bin distribution, truncation rate
These are not correctness metrics, but they detect “incidents-in-the-making.”
Plane 3: Offline scoring anchored to production traces
Run batch scoring on sampled production traces, where evaluators receive the same evidence that runtime used:
- user input
- retrieved contexts (doc IDs/hashes + optionally sampled text)
- tool response snapshots/hashes
- final answer (or response structure)
- stage flags (truncation, fallback)
Evaluator types
- Deterministic checks: JSON validity, schema compliance, formatting invariants
- Rubric-based judges: LLM-as-judge with strict rubric + versioning
- Human review: for high-risk segments, judge uncertainty, or recurring failure clusters
Segmentation (non-negotiable)
Compute scores per segment, not just overall averages. Examples:
- RAG-on vs RAG-off
- tool-using vs pure-chat
- truncation=true vs truncation=false
- retrieval confidence bins (top-1 quantiles)
- session length bucket
- guardrail decision category (refusal vs answer, or allow vs rewrite vs block)
Overall averages hide regressions that only occur in specific runtime paths.
Replay semantics: avoid side effects when you “re-evaluate prod”
A major failure mode: re-running over prod traces but accidentally changing what external systems return (or causing real side effects).
Correct approach: snapshot or mock tool outputs
When you capture a trace, store enough to replay scoring safely:
- tool response snapshot payload (redacted) or
- structured essentials needed for evaluation (e.g., status, key fields, error reason) or
- a hash + deterministic lookup table that maps hash → redacted snapshot
Also store tool metadata:
- status code / success/error
- timeout indicator
- response latency (for diagnostics)
Then offline evaluators score against the captured evidence rather than by calling the live tool again.
If you must re-run tools
Do it in a sandbox with idempotency. Treat “sandbox re-run scoring” as separate from “evidence replay scoring,” and track the difference explicitly so you know what you’re trusting.
What to log (so incidents are debuggable)
When something breaks, you need five answers fast:
- What changed?
- Which path ran?
- Was context truncated?
- Did tools fail or return bad evidence?
- What confidence proxy predicted?
Log these query-ready fields:
Change attribution
model.versionsystem_prompt_version(or prompt hash)retrievalConfigHashembeddingModelVersiontoolSchemaVersionjudgeRubricVersion(for offline scoring runs)
Path and truncation
toolRouteChosen(or “tool called?”)stopReason/ stop mechanismtruncatedboolean- dropped message/chunk hashes or counts
- history length, token counts
Evidence fields for scoring
- retrieval: empty rate, top-1 doc IDs/hashes, score bin
- tool: arguments hash, tool response hash, tool status
Confidence proxies
- parse/format validity and structured error codes
- guardrail reason codes
- judge output metadata (if supported), plus judge rubric ID
Failure modes that matter in real LLM systems
Below is a production-relevant taxonomy. For each, the “gate-worthy” logs differ.
1) Retrieval failure (RAG-specific)
Symptoms
- answers sound confident but are wrong,
- grounding is missing but rubrics still pass,
- failures cluster around certain intents or low-confidence retrieval.
Likely causes
- embedding drift or query rewrite regressions,
- chunking changes,
- index freshness issues,
- retrieval thresholds too strict or top-k too small.
Log & measure
emptyRetrievalRateby segment- retrieval top-1 score bin distribution shifts
- fraction of traces where retrieved chunks were not referenced in final response (if you track attribution)
Gates
- segment-level empty rate change (use change-relative baselines)
- segment-level top-1 bin distribution shift
2) Context boundary degradation
Symptoms
- long sessions fail while short sessions are fine,
- “hallucinations” correlate with token budgets,
- formatting breaks only after long history.
Likely causes
- truncation without stage awareness,
- oversized system prompts,
- tool outputs injected verbatim without summarization/compression.
Log & measure
truncatedflag- input token count and remaining completion budget
- history length and dropped item counts
Gates
- quality drop correlated with
truncated=true - format/parse validity drops at truncation boundary
3) Tool invocation & schema issues
Symptoms
- JSON parse failures,
- wrong tool chosen,
- agent loops / stops prematurely,
- policy blocks due to tool output mismatch.
Likely causes
- prompt format changes breaking routing,
- schema mismatch after tool updates,
- timeouts and stop-condition bugs.
Log & measure
- tool routing input signatures (hashes)
- tool arguments hash
toolStatussuccess/timeout/error- tool response sizes and timeouts
Gates
- tool timeout rate spikes
- invalid tool args rate spikes
- loop detection triggers rate spikes
4) Evaluation drift (your evals get stale)
Symptoms
- offline passes remain green while production decays.
Likely causes
- dataset stops matching traffic mix,
- rubric no longer correlates with user success,
- evidence passed to judges differs from runtime evidence.
Fixes
- sample recent production traces into eval batches
- version rubric/judge prompts
- evaluate the same evidence the runtime used (doc IDs/tool snapshots/truncation flags)
5) Cost-driven collapse under load
Symptoms
- quality drops during traffic spikes,
- outputs become shorter or more refusals,
- fallbacks increase.
Likely causes
- rate limiting and retries,
- token budget enforcement under load,
- autoscaling stalls causing earlier truncation.
Log & measure
- stage latency
- rate-limit indicators
- retry counts
- final truncation rate under load
Gates
- quality metrics conditioned on “budget constrained” traces
- fallback rate correlated with load
Example incident (fully grounded, no magical thinking)
Consider a “support order status” agent that should call a tool to fetch the order state.
Offline setup (common mistake)
- The eval dataset includes examples where the tool is never invoked (or tool outputs are assumed).
- The judge rubric scores the final text for “helpfulness” and “correctness,” but it’s not conditioned on whether the answer is grounded in tool evidence.
Production reality
- After a routing change, the agent sometimes decides “retrieval looks okay” and skips the tool.
- The final text is polished, but it is no longer tool-grounded.
How traces identify it
- Segment on
toolUsed=false - Score “tool-grounded correctness” only for segments where the tool is required
- You should see failures clustered in the tool-skipped segment, not globally
This is exactly why stage-level evidence and segmentation beat overall averages.
Practical implementation: Monday-morning steps
Step 0: Define production success (two layers)
Pick at least two layers:
- Task success: bounded proxies or (where feasible) human labels
- Operational success: latency/cost constraints, no tool errors, parse/format correctness, correct fallback behavior
Be explicit about what a proxy means. “Thumbs down” could be multiple failure types; define which stage it correlates with.
Step 1: Create deterministic trace objects with fingerprints
// Minimal shape: enough to reproduce scoring semantics safely.
export type Trace = {
traceId: string;
req: {
inputTextHash: string; // keep raw only in sampled traces
sessionIdHash: string;
channel?: "web" | "mobile";
locale?: string;
};
config: {
model: { name: string; version: string };
systemPromptVersion: string;
temperature: number;
maxTokens: number;
toolSchemaVersion?: string;
retrievalConfigVersion?: string;
embeddingModelVersion?: string;
judgeRubricVersion?: string; // for offline runs
};
evidence: {
retrieval?: {
empty: boolean;
topKDocIds: string[]; // stable doc IDs or hashed IDs
topKScoreBins: number[]; // quantized bins (privacy + stability)
usedInFinalAnswer: boolean; // computed if you do citation/attribution
};
tool?: Array<{
toolName: string;
schemaVersion: string;
argsHash: string; // redacted
status: "success" | "timeout" | "error";
responseHash?: string;
// Store responseSnapshot only for sampled traces/high-risk tools.
// snapshot?: { ...redacted essential fields... }
}>;
};
runtime: {
truncated: boolean;
inputTokenCount: number;
completionTokenCount: number;
latencyMs: number;
guardrails: Array<{ ruleId: string; decision: "allow" | "block" | "rewrite" }>;
fallbackUsed: boolean;
};
outputs: {
rawModelTextHash: string;
finalAnswerHash: string;
formatValid: boolean;
formatErrorCode?: string; // e.g., "JSON_SCHEMA_VIOLATION"
};
};Step 2: Generate segment keys (and keep them stable)
Compute stable keys from versioned config + runtime path signals:
- rag_on/off
- tool_called (and tool schema version)
- truncation bucket
- retrieval confidence bin bucket
- model version
This makes gating queries tractable and prevents accidental “mixing apples and oranges.”
Step 3: Sample production traces with failure-mode aware weighting
Uniform sampling hides problems.
Start with weighted sampling that increases coverage for segments that historically fail:
- high latency
- truncated=true
- tool timeouts/errors
- emptyRetrieval or low-confidence retrieval bins
- recent negative user feedback events (if available)
- judge uncertainty (if your judge reports confidence)
Step 4: Run batch scoring with strict evidence alignment
When scoring with a judge:
- version the judge prompt/rubric
- pass the same evidence the runtime used
- record judge failures explicitly (timeouts, invalid judge output, rubric mismatch)
Step 5: Gate deploys using confidence intervals + hysteresis
Gating must be change-relative and segment-aware. Use baselines from the previous release (or rolling window) and compute uncertainty.
Important: below is illustrative pseudo-logic. Replace placeholders with your actual baseline computation and CI method.
-- Illustrative pseudo-query (NOT production SQL):
-- Segment: non-truncated traces on toolRoute=tool_v1
-- Metric: invalid JSON format rate among sampled traces
-- Baseline: last release distribution for the same segment
metric invalidFormatRate = invalid_format_count / total_count;
gate invalid_format_deploy_block
if metric.invalidFormatRate > baseline.invalidFormatRate + (ci95.invalidFormatRateWidth)
and metric.total_count >= min_samples
for consecutive_runs >= 2;The key idea:
- min_samples prevents gates from reacting to noise
- confidence interval width prevents “one bad day” from blocking deploys
- consecutive_runs/hysteresis prevents flapping
Step 6: Add regression detection tied to what changed
Track deltas between deploys:
model.versionchangesystemPromptVersionchange- retrieval/embedding config version change
- tool schema version change
- truncation behavior change (if any config changed token budgets)
Then compute which failure tags increased by stage/segment. You won’t get perfect causality, but you’ll get actionable hypotheses quickly.
A concrete end-to-end gate example (with labeled synthetic numbers)
Scenario
You want to block deployments when format validity regresses for traces:
toolSchemaVersion = "tool_v1"truncated = falserag_on(retrieval enabled)
Data slices (illustrative / synthetic)
Assume your evaluator computed, for one release window:
- baseline invalid format count = 18 / 9,000
- current invalid format count = 41 / 9,000
These are example numbers to demonstrate the mechanics, not actual operational values.
Gate logic (illustrative)
- compute invalid format rates:
- baseline rate = 0.2%
- current rate = 0.455%
- compute a confidence interval (binomial/bootstrapped) for the difference
- block only if:
- difference exceeds the confidence interval threshold, and
n >= min_samples, and- regression repeats for N consecutive deployments
Output you should store for auditability
- segment key
- baseline window definition (timestamps/versions)
- sample sizes
- computed CI parameters
- metric counts
- deploy decision with rubric version IDs
This “why did we block?” audit trail is as important as the decision.
Further reading
- Microsoft, Taxonomy of Failure Mode in Agentic AI Systems
https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/microsoft/final/en-us/microsoft-brand/documents/Taxonomy-of-Failure-Mode-in-Agentic-AI-Systems-Whitepaper.pdf - Langfuse, LLM Evaluation 101: Best Practices, Challenges & Proven Techniques
https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges - Datadog, Building an LLM evaluation framework: best practices
https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/ - Comet, LLM Evaluation: The Ultimate Guide to Metrics, Methods & Best Practices
https://www.comet.com/site/blog/llm-evaluation-guide/
If you only do three things next
- Bind evaluation to production traces with stage-level metadata and versioned configs.
- Gate deploys on segmented, change-relative operational + rubric signals (with confidence + hysteresis).
- Log and replay evidence (retrieval/tool/context/truncation) so failures become debuggable categories, not vague complaints.
Offline evals aren’t useless—they’re just not the authority. In production, truth lives in the trace.
Related Writing.
Keep reading with related notes on software engineering, architecture, and implementation decisions.
Financial APIs Need Idempotency Before They Need Fancy Retries
In payment and ledger flows, retries are normal. Idempotency is what keeps those retries from turning into duplicate money movement.
Cosine Similarity vs Dot Product Depends on Your Embeddings
The right vector similarity metric depends on whether embeddings are normalized and how the retrieval system interprets magnitude versus direction.