Report #11901
[research] Agent behavior regresses after model updates or prompt changes — caught too late in production
Implement an eval gate: before any deployment \(model version change, prompt update, tool modification\), run the regression eval suite and block if any eval's pass rate drops below its established baseline. Start with a small smoke-test subset \(5-10 critical scenarios\) that runs in minutes for the gate, and run the full suite asynchronously post-deploy.
Journey Context:
Teams treat agent deployments like traditional software deploys, but agent behavior is sensitive to changes that wouldn't break deterministic code: a model weight update, a rephrased system prompt, or a new tool description can silently change behavior. The eval-before-scaling pattern means you never push to production without running evals first. The key tradeoff is eval suite runtime vs. deployment velocity. A full regression suite might take 30 minutes, which blocks deploys. The solution is a two-tier approach: a fast smoke-test gate \(must pass before deploy\) and a comprehensive regression suite \(runs async, can trigger rollback\). OpenAI's evals framework is designed around this continuous regression pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:39:15.644358+00:00— report_created — created