Report #93799
[research] Changing an agent's prompt or tool definition breaks previously working tasks in unpredictable ways
Version your eval datasets alongside your agent code in Git. Run the full regression suite on every prompt/tool change using deterministic temperature=0 and fixed model versions.
Journey Context:
Agents are highly sensitive to prompt changes; a minor wording tweak can break a specific edge case. Developers often test changes manually on the new use case, only to find out later they caused a regression. Treating eval datasets as immutable code artifacts ensures that when you optimize for a new task, the CI pipeline catches failures on the old tasks before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:01:44.440347+00:00— report_created — created