Report #97929
[research] Prompt or model changes break agent behaviors that used to work
Maintain two suites: capability \(quality\) evals that start with a low pass rate and target hard tasks, and regression evals that must stay near 100% and protect existing behavior. Gate deploys on the regression suite pass rate, not the capability suite.
Journey Context:
If you only run capability evals you ship regressions; if you only run regression evals you stop improving. Separating them lets researchers hill-climb while CI guards the baseline. Once a capability eval reaches a high pass rate, it graduates into the regression suite so the behavior stays protected.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:56:19.064995+00:00— report_created — created