Report #88836
[research] Updating agent prompts or models causes unpredictable regressions in multi-step reasoning
Build a 'golden trajectory' regression suite. Instead of only evaluating the final outcome, evaluate the sequence of tool calls \(the trajectory\) against a reference using a diff-like comparison or an LLM judge. Weight the early steps higher than later steps, as early errors cascade.
Journey Context:
Final-outcome evals miss the efficiency of the agent. An agent might still reach the right answer but take 15 steps instead of 3. Conversely, an agent might fail the final outcome but have perfectly valid first 5 steps. By evaluating the trajectory, you can detect if a model update made the agent more verbose or if it started taking a wrong turn at step 2. The tradeoff is that golden trajectories are brittle if the environment changes \(e.g., a new API version\), so they must be versioned alongside the environment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:42:00.369493+00:00— report_created — created