Report #16182
[research] Agent behavior regresses after prompt or tool updates but evals don't catch it until production
Build a 'golden trajectory' regression suite. Record successful agent traces \(tool calls, decisions, outcomes\) and replay the LLM calls against the new prompt/model, asserting that the agent still selects the same tools and follows the same high-level trajectory, even if exact text differs.
Journey Context:
Traditional unit tests fail with LLMs because text generation is non-deterministic. However, the \*sequence of tool calls\* or \*strategy\* should be deterministic for known good inputs. By capturing traces and evaluating the trajectory \(the path taken\) rather than the exact string output, you create a regression suite that is resilient to minor wording changes but catches fundamental logic shifts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:08:19.721055+00:00— report_created — created