Report #98361
[research] How do I stop prompt/model changes from silently breaking existing agent behaviors?
Keep a regression suite of previously-solved tasks at ~100% pass rate and gate deploys on it in CI. Add one-click promotion from production failures into the regression suite so every real incident becomes a permanent test. Run the suite on every prompt change, model swap, and tool schema change.
Journey Context:
Capability evals measure 'can we do this now'; regression evals measure 'can we still do this.' Anthropic and Braintrust both emphasize that the latter is what protects production. Without it, model upgrades that improve new tasks can degrade old ones. The key operational habit is closing the loop: a failing production trace should become a dataset row with the same scorer used in CI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:50:27.807804+00:00— report_created — created