Report #34997
[research] Agent breaks previously working capabilities after a prompt or model update
Maintain a golden dataset of successful trajectories \(prompt, tool calls, final answer\). Run this as a regression suite on every change, using exact match on tool call signatures and semantic match on outputs.
Journey Context:
Unlike traditional software where unit tests catch regressions, agent updates \(even minor prompt tweaks\) can cause unpredictable side-effects in unrelated capabilities. A regression suite of trajectories ensures that optimizing for a new edge case doesn't break a core workflow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:12:50.566392+00:00— report_created — created