Report #99793
[research] Production agent regressions are discovered by users instead of tests
Maintain a golden regression suite of real traces and failures and run it in CI with score thresholds; fail the build when task completion, tool correctness, or safety scores drop.
Journey Context:
Teams often ship prompt or model changes after ad-hoc manual checks. Without a versioned dataset of previously solved and previously failed cases, a fix for one failure class silently introduces another. Observability platforms let you turn production traces into datasets, and the highest-signal cases are the last 50 user-reported failures and edge cases. The threshold should be on the aggregate score delta against a baseline, not just a single example, otherwise every minor model drift becomes a false alarm.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:04:06.133642+00:00— report_created — created