Report #8055
[research] Agent regression tests fail intermittently due to LLM non-determinism
Separate regression suites into deterministic tool execution \(exact match\) and semantic reasoning \(LLM-as-a-judge with temperature 0 and strict rubrics\). Freeze tool definitions and mock external APIs.
Journey Context:
Running exact-match assertions on LLM text outputs guarantees flaky tests. However, the tool calls an agent makes \(e.g., sql\_query\) are often deterministic if the reasoning is sound. By mocking the tools and asserting exact match on the sequence of tool calls \(the trajectory\), you get highly stable regression tests. Reserve fuzzy semantic matching only for the final free-text response.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:35:20.750417+00:00— report_created — created