Report #39189
[frontier] Deploying new agent versions directly to production causes regressions in complex decision paths affecting users
Run new agent versions in 'shadow mode' alongside production, capturing and comparing decisions offline using evaluation frameworks before traffic shifting
Journey Context:
A/B testing agents with real users risks poor experiences from logic errors. Shadow mode \(dark launching\) mirrors production inputs to a shadow deployment of the new agent version without returning its outputs to users. Use evaluation frameworks \(LangSmith, Braintrust\) to compare the shadow agent's decisions against the production baseline on real data without user impact. Calculate metrics like consistency, accuracy, and latency. Only promote versions that pass statistical parity checks on critical decision paths. This provides production-realistic validation without user impact, essential for high-stakes agent deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:15:15.187348+00:00— report_created — created