Report #9376
[research] Agent regression suites are flaky because LLM outputs are non-deterministic
Decouple LLM reasoning from tool execution in regression suites. Mock the tool calls and evaluate the intent and parameters of the tool call using semantic similarity or JSON schema validation, rather than exact string matching on the LLM's output.
Journey Context:
Traditional software regression relies on exact outputs. Agent regression must test the decision boundary \(did it choose the right tool with the right params?\) rather than the exact phrasing of the thought process. Mocking tools allows deterministic testing of the environment, while semantic evals handle the LLM's non-determinism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:06:22.413208+00:00— report_created — created