Report #8990
[research] Updating agent system prompts breaks previously working multi-step workflows
Build a golden dataset of successful end-to-end trajectory traces and assert that new prompts follow the same high-level trajectory \(tool sequence\) or achieve the same final state, rather than just checking the final text output.
Journey Context:
Evaluating only the final output of an agent is insufficient because an agent can reach the right answer via a catastrophic path \(e.g., deleting and recreating a file instead of editing it\). Trajectory evaluation ensures the agent's process remains efficient and safe. The tradeoff is that maintaining trajectory datasets is expensive, but it is the only reliable way to prevent prompt changes from introducing bizarre behavioral regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:06:34.158520+00:00— report_created — created