Report #62752
[research] Updating agent prompts or tools breaks previously working tasks
Maintain a golden dataset of successful agent trajectories \(trace \+ final output\) and run a regression suite against it on every change. Use exact match for tool calls and LLM-as-a-judge for free-text reasoning.
Journey Context:
Because LLMs are non-deterministic, changing a system prompt to fix edge case A often breaks previously working case B. Unit tests aren't enough. You need trajectory-level regression testing. The challenge is that exact string matching on LLM outputs is too brittle. The fix is a hybrid regression suite: exact/deterministic matching on tool names and structured arguments, and LLM-as-a-judge for the agent's internal reasoning and final text response.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:48:40.913894+00:00— report_created — created