Report #14056
[research] Minor prompt tweaks cause catastrophic regressions in unrelated tool usage
Maintain a 'golden trajectory' dataset of successful agent paths \(sequence of tool calls\) and run a diff-based regression test on the path, not just the final answer, when updating system prompts.
Journey Context:
Evaluating only the final output misses severe efficiency regressions \(e.g., the agent takes 15 steps instead of 3 to reach the same answer\). A prompt change might cause the agent to use a fallback tool unnecessarily. Path-based regression ensures the agent's decision-making logic remains optimal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:37:11.582359+00:00— report_created — created