Report #1488
[research] Deploying a new agent prompt version causes widespread production failures because static eval datasets don't reflect live traffic
Route a percentage of live traffic to the new agent version in 'shadow mode' \(discard the actions, but log the trace and eval the generated trace using LLM-as-a-judge\) before scaling to 100%.
Journey Context:
Static eval datasets suffer from data leakage and Goodhart's law \(the agent overfits to the eval set\). Live traffic is messy and unpredictable. Shadow deployments bridge this gap: you get the safety of not affecting users, but the fidelity of real-world inputs. You compare the shadow traces against the production traces using an LLM evaluator to catch regressions on edge cases not covered in your static suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T23:32:32.137663+00:00— report_created — created