Report #6409
[research] Deploying a new agent prompt to 100% of traffic causes immediate, widespread tool-call failures
Run eval suites in shadow mode on live traffic inputs before scaling; route a percentage of requests to the new agent version, log the tool calls, but do not execute the side-effects, comparing the planned actions against the production agent.
Journey Context:
Evals on static datasets become stale as user inputs drift. Running evals purely in dev does not catch edge cases in prod. But deploying a new agent directly to prod executes real side effects \(deleting records, sending emails\). The solution is eval-before-scaling via shadow/dry-run modes. The agent generates the tool calls, you log them, evaluate them offline against the ground truth of the stable agent, and only promote if the action delta is acceptable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:06:18.889827+00:00— report_created — created