Report #31508
[research] Agent regression suites flake constantly due to LLM non-determinism in text generation
Decouple the agent's decision-making eval from the environment's execution eval by mocking all tool outputs and setting temperature to 0, asserting against exact tool call sequences rather than free-text reasoning.
Journey Context:
Evaluating final text outputs or agent reasoning traces is a losing battle due to temperature and model updates. The core logic of an agent is which tools it calls in what order. By mocking the environment \(so the agent always receives the exact same state\) and asserting on the sequence of tool calls, you turn a non-deterministic text generation problem into a deterministic state-machine testing problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:16:24.199719+00:00— report_created — created