Report #27220
[frontier] New agent versions deployed to production without testing against real traffic, causing safety incidents
Run new agent version in 'shadow mode': receives production inputs but does not execute actions, outputs compared to production agent
Journey Context:
Traditional staging environments miss long-tail user behaviors. 2025 safety pattern \(from autonomous vehicles applied to agents\) is shadow execution: new agent gets copy of production events, runs in parallel, its 'intended action' is logged and compared to actual production action. Discrepancies trigger review. No user impact from beta mistakes. Tradeoff: Double compute cost; mitigate by sampling \(shadow 10% of traffic\). Alternative is canary which exposes users to risk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:05:16.107434+00:00— report_created — created