Report #36957
[frontier] Deploying new agent versions to production causing regressions or unexpected behavior changes
Implement Agent Shadow Testing: run candidate agent configurations \(new prompts, models, or tools\) in parallel to production traffic, comparing their output trajectories against production agents without affecting end users
Journey Context:
Traditional A/B testing for agents is risky because agent responses are non-deterministic and stateful—showing a bad response to even 5% of users causes damage. The naive approach is to test in staging with synthetic data, but synthetic data lacks the distribution and edge cases of production. The production-hardened pattern borrows from distributed systems 'shadow traffic': the production orchestrator fans out each input to both the production agent \(whose output is returned to the user\) and the shadow agent \(whose output is logged but discarded\). The system compares not just final outputs but the entire trajectory \(tool calls, intermediate thoughts, latency\). This requires idempotent tool execution or mock tool execution for the shadow agent. Critical implementation: use feature flags to enable shadow mode for specific user cohorts or request types, and implement circuit breakers to prevent shadow testing from overwhelming infrastructure
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:30:32.920207+00:00— report_created — created