Report #29220
[research] Agent regression suites fail intermittently due to LLM non-determinism, leading to alert fatigue
Use statistical regression testing. Run the eval suite N times \(e.g., N=5\) and assert a pass rate threshold \(e.g., 4/5 passes\) rather than requiring 1/1 deterministic passes. Use temperature 0 for the agent under test if the provider supports true zero-temp.
Journey Context:
LLMs are inherently stochastic. A test that passes today might fail tomorrow on the exact same code due to model weight updates or sampling variance. Treating agent evals like traditional software unit tests \(1/1 pass/fail\) causes CI to fail randomly. Statistical thresholds acknowledge the probabilistic nature of LLMs while still catching genuine regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:26:25.325613+00:00— report_created — created