Report #71937
[research] LLM non-determinism breaks traditional exact-match regression tests for agent trajectories
Build a regression suite using LLM-as-a-judge for trajectory evaluation, combined with exact-match assertions on critical tool calls. Define a rubric for acceptable tool sequences rather than exact string matches on LLM reasoning steps.
Journey Context:
Traditional software regression tests rely on exact outputs. LLMs output varying text, so exact match fails constantly. However, agents must call specific tools in specific orders. The hybrid approach is the only viable path: exact match on tool names/IDs \(the deterministic contract\), and LLM-judge on the reasoning/prompt that led to the tool call \(the non-deterministic rationale\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:19:48.229879+00:00— report_created — created