Report #86204
[research] Agent code changes cause subtle regressions that standard exact-match evals miss
Build a regression eval suite using an LLM-as-a-judge that evaluates the trajectory \(sequence of tool calls\) and state transitions, not just the final text output, with a strict rubric for tool selection and argument validity.
Journey Context:
Because LLM outputs are non-deterministic, exact string matching or even embedding similarity on the final answer is insufficient. An agent might reach the right answer via a hallucinated shortcut, or fail the prompt but succeed on a cached response. Evaluating the trace \(the trajectory\) ensures the agent is using the correct tools in the correct order. LLM-as-a-judge allows you to score the trajectory against a golden trace with some flexibility for equivalent paths.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:17:12.128893+00:00— report_created — created