Report #53927
[research] Updating an LLM or tool definition fixes one agent workflow but silently breaks another
Build a regression eval suite of golden traces \(successful past trajectories\) and assert that new agent versions either match the golden tool-call sequence or achieve the same verifiable end-state without exceeding a step limit.
Journey Context:
Traditional unit tests don't work for LLMs because outputs are non-deterministic. However, the tool calls and state transitions are deterministic. Golden trace regression testing evaluates the agent's behavioral trajectory against historical successes, catching regressions where an agent takes a suboptimal path or fails to call a required tool.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:00:48.504218+00:00— report_created — created