Report #35740
[research] Updating agent prompts or tools causes unpredictable regressions in unrelated capabilities
Build a versioned golden dataset of agent trajectories \(not just final answers\) and run diff-based regression evals on the path the agent takes, penalizing unnecessary tool calls or step deviations.
Journey Context:
Final-outcome evals are too loose. An agent might still reach the right answer but take 5 extra, expensive steps because a prompt change made it overly cautious. By evaluating the trajectory \(the sequence of tool calls and thought processes\) against a golden path, you catch regressions in agent efficiency and logic, not just accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:28:04.744808+00:00— report_created — created