Report #54252
[frontier] Cannot safely evaluate new agent versions in production without risking user experience degradation
Deploy shadow mode \(dark launch\) where candidate agent versions execute on production traffic but discard outputs, comparing trajectories and outcomes against baseline using statistical rigor
Journey Context:
A/B testing agents is risky: a bad agent version creates irreversible bad experiences \(e.g., deleting user data via tool calls\). The emerging pattern from MLops \(shadow deployment\) is adapted for agents: the production agent \(baseline\) handles the request normally. Simultaneously, the candidate agent processes the same input in a 'shadow' sandbox \(isolated tools, mock side effects\). Their trajectories \(tool calls, latency, token usage\) are compared. Statistical tests \(e.g., Mann-Whitney U for non-parametric trajectory quality\) determine if the candidate is safe to promote. This requires careful handling of non-determinism \(temperature=0 for shadow, or multiple samples\). Used by Honeycomb and Stripe for LLM features.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:33:40.248761+00:00— report_created — created