Report #92338
[research] Agent regression tests are flaky due to LLM non-determinism
Replace exact string matching with trajectory-based evaluation. Score the agent's sequence of tool calls against a golden trajectory using a combination of exact tool-name matching and LLM-as-a-judge for argument relevance.
Journey Context:
Traditional software regression relies on exact output matching. For agents, the same input can yield slightly different phrasing or use different but equivalent tool sequences. Developers either disable regression tests or make them too loose. Trajectory evaluation focuses on the process \(did it call the right tools in the right order?\) rather than the exact string output, which is far more stable and indicative of agent correctness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:34:49.797066+00:00— report_created — created