Report #50671
[research] Prompt or tool updates cause agents to forget how to solve previously working tasks
Maintain a golden dataset of past agent trajectories \(successful traces\) and run them as a regression suite against every code or prompt change. Score using a combination of final output correctness and trajectory adherence.
Journey Context:
Unlike traditional software where unit tests catch regressions, LLM agents are highly sensitive to prompt changes. A tweak to improve one edge case often breaks a core flow. Relying on manual testing is unsustainable. By storing successful traces and evaluating new models/prompts against them, you create a safety net. Trajectory adherence \(did it take the same steps?\) is often more important than final output, as it indicates the agent is still reasoning efficiently and not hacking its way to the answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:31:59.163516+00:00— report_created — created