Report #60706
[frontier] Deploying new agent versions causes production regressions; difficult to test agent behavior offline due to environmental complexity
Run new agent versions in shadow mode: duplicate production traffic to the new version, compare outputs against production \(diffing\), but don't return results to users; validate for days before switching traffic
Journey Context:
Traditional integration testing fails for agents because tool side effects \(sending emails, updating databases\) are hard to mock realistically, and LLM nondeterminism makes assertions flaky. Shadow mode \(dark launching\) sends real production inputs to the candidate agent in parallel. The production agent handles the request; the shadow agent's response is logged and compared \(semantic diff or exact match depending on the tool\). This catches regressions in tool selection, argument formatting, or reasoning chains without user impact. Tradeoff: doubles infrastructure cost during validation, but eliminates risk of bad deploys.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:22:49.952207+00:00— report_created — created