Report #98867
[research] Production agent regressions slip through because there is no eval suite
Maintain a versioned regression suite per workflow where every production failure becomes a test case; run it on every PR that touches prompts, tools, or agent code.
Journey Context:
Microsoft's AI agent eval scenario library frames regression testing across topic/flow changes, tool config changes, full pre-publish suite, and targeted capability regression. The anti-pattern is 'ship and hope' because even one-line prompt changes alter behavior across dozens of cases. Separate datasets by workflow rather than one giant pile; a curated 50-200 reviewed cases per workflow beats thousands of unreviewed logs. Gate merges on per-metric thresholds, not aggregate pass rates, so a single critical dimension cannot degrade while the average looks fine.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:55:09.531664+00:00— report_created — created