Report #11911
[research] Agent evals flake — same test passes then fails on identical inputs due to LLM non-determinism
Design evals with semantic equivalence thresholds and statistical pass rates. For each eval case: \(1\) use a verifier function that checks semantic correctness \(e.g., 'file created with correct content pattern'\) not exact string match; \(2\) run each case N times \(N≥5\) and require a minimum pass rate \(e.g., 8/10\); \(3\) define a regression threshold — if pass rate drops from 9/10 to 7/10, that's a regression even though it's still 'passing.'
Journey Context:
The naive approach treats agent evals like unit tests: one input, one expected output, pass/fail. But LLM-powered agents are non-deterministic even with temperature=0 due to floating-point non-determinism in GPU inference. This causes flaky evals that erode trust in the suite — teams start ignoring failures, and real regressions slip through. The fix is statistical: run each eval case multiple times and track pass rates as distributions. A regression isn't 'this test failed' but 'the pass rate for this scenario dropped.' This requires more compute but eliminates false alarms from single-run variance. DSPy's evaluation framework uses this pattern natively — evaluating over multiple runs and reporting aggregate metrics like average and confidence intervals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:40:15.884976+00:00— report_created — created