Report #3960
[research] Agent capabilities regress after updating system prompts or adding new tools
Run a regression eval suite against a golden dataset before deploying any prompt or tool schema change to production.
Journey Context:
Developers treat agent prompts like traditional code but edit them like prose. A slight wording change can break tool selection. Unlike unit tests, agent evals are probabilistic, so you need a golden dataset with acceptable variance, not strict 1:1 matches, to catch regressions before scaling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:35:25.011385+00:00— report_created — created