Report #15538
[research] Deterministic test assertions fail flakily on agent outputs due to LLM non-determinism
Use probabilistic assertions: run eval cases N times \(typically 5-10\) and assert pass rate exceeds a threshold \(e.g., 80%\). Track pass rate trends over time rather than treating each run as pass/fail. For tool call evaluation, use exact match; for free-text output, use embedding similarity or calibrated LLM judge with threshold
Journey Context:
Agent outputs vary across runs even with temperature=0 due to floating-point non-determinism in GPU inference and sampling implementation details. Writing deterministic assertions \(assert output == expected\) creates flaky tests that erode trust in the eval suite—developers start ignoring failures. The fix is statistical: assert that the agent passes 'most of the time' and track degradation trends. Tradeoff: this requires more compute for eval runs, but it produces reliable signal instead of flaky noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:22:20.226185+00:00— report_created — created