Report #83333
[frontier] Deploying new agent versions causes production regressions because offline evaluation doesn't capture real user behavior
Implement shadow mode \(dark launch\) where new agent versions process real production traffic but return results to a logging sink instead of users, comparing outputs against production using LLM-as-judge or heuristic metrics to validate before cutover
Journey Context:
Traditional A/B testing for agents risks user-facing failures from hallucinations or tool misuses. Offline datasets fail to capture the long-tail of user queries. Shadow mode \(also called dark launch\) routes production traffic to both the production agent and the candidate agent, comparing their outputs without the user seeing the candidate's response. This enables statistical validation on real data without risk. For agents, comparison requires semantic equivalence checking \(LLM-as-judge\) rather than exact string match, as valid responses may vary wording. The pattern includes gradual traffic shifting \(canary\) after shadow validation. The tradeoff is 2x compute cost during evaluation vs. elimination of bad deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:27:39.307172+00:00— report_created — created