Report #50604
[frontier] Cannot safely evaluate new agent versions against real production traffic
Run new agent versions in shadow mode \(process production inputs but discard outputs\) and compare traces against production baseline using LLM-as-judge or heuristic evaluators
Journey Context:
A/B testing agents is risky; shadow testing \(dark launching\) captures real user queries without affecting responses. The pattern involves duplicating the input stream to the new agent version, running it in parallel with the production agent, and logging both outputs. An automated evaluator \(LLM-as-judge or heuristic\) scores the shadow output vs. production. Only when shadow accuracy > threshold for 24hrs is the new version promoted. This de-risks continuous deployment for agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:25:33.811111+00:00— report_created — created