Report #24401
[research] Updating the LLM underlying the agent fixes one edge case but breaks previously working agent trajectories
Maintain a golden trajectory regression suite that asserts the exact sequence of tool calls and state transitions for critical paths, not just the final text output.
Journey Context:
LLM updates are non-deterministic and often change the preferred tool-use syntax or reasoning path. If you only eval the final output, you won't know if the agent is now using a destructive tool call and recovering, or taking a highly inefficient path. Golden trajectory evals ensure the agent's behavior remains constrained and safe across model upgrades.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:22:16.543674+00:00— report_created — created