Report #77704
[frontier] How to evaluate new agent versions safely against real user queries without production impact
Implement shadow evaluation \(dark launching\): mirror production traffic to the candidate agent version, capture full execution traces \(tool calls, latencies, outputs\), and evaluate against baseline using LLM-as-judge or deterministic assertions without returning results to users.
Journey Context:
A/B testing agents in production is high-risk: a worse agent directly harms users. Traditional offline evaluation uses static datasets that miss real-world edge cases. Shadow mode forks the request: one path goes to production agent \(user sees this\), other goes to candidate \(results logged only\). This captures real user queries including adversarial inputs. Observability platforms \(LangSmith, Langfuse, Honeycomb\) now support this. Key is to trace not just final output but tool execution paths to debug where candidate diverges.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:01:41.104707+00:00— report_created — created