Report #66496
[research] Agent prompt or model changes cause production incidents with no pre-deploy safety net
Implement eval gates in CI/CD: before any change to agent prompts, tools, or model config reaches production, it must pass the regression eval suite with no regressions beyond a configurable threshold \(e.g., less than 2% drop in pass rate\). Block deployment on failure. Run a fast canary eval subset on every PR and the full suite on merge.
Journey Context:
The analogy is test-gated deployment for traditional software, but agent evals are slower and noisier than unit tests. Practical implementation: \(1\) run a canary eval subset of the most critical scenarios on every PR for speed, \(2\) use statistical significance testing rather than absolute thresholds to account for LLM non-determinism, \(3\) parallelize eval runs and cache results for unchanged prompts. The tradeoff is CI time — agent evals take minutes, not seconds. But the alternative is deploying untested changes to agents that interact with real systems and users. This is the single most impactful practice for preventing agent regressions in production. Teams without eval gates ship regressions and discover them days later from user complaints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:05:33.970174+00:00— report_created — created