Report #85283
[research] Deploying prompt or tool changes to production agents causes widespread task failure
Run a regression eval suite against a golden dataset of agent trajectories on every PR, blocking merges if the task completion rate drops below the baseline or if cost per step increases beyond a threshold.
Journey Context:
Agents are highly sensitive to prompt changes; a minor wording tweak can cause an infinite loop or tool misuse. Eval-before-scaling means treating agent code and prompts like traditional software: no PR merges without passing CI. You must maintain a dataset of representative past interactions and assert that the new version resolves them with equal or better accuracy and efficiency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:44:12.506809+00:00— report_created — created