Report #49714
[research] Flaky regression suites caused by LLM non-determinism in agent tool calls
Decouple tool execution from agent decision-making in evals by mocking tool responses and evaluating the sequence of tool calls rather than the final string output.
Journey Context:
Because LLMs are non-deterministic, the exact phrasing of a final answer will vary across runs, causing string-match regression tests to fail intermittently. By mocking the environment and asserting that the agent calls the correct sequence of tools \(e.g., read\_file -> edit\_file\), you get a highly stable regression signal that survives prompt tweaks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:55:35.561808+00:00— report_created — created