Report #17130
[research] Agent regression tests flake constantly due to LLM non-determinism
Replace boolean pass/fail assertions with statistical regression thresholds. Run the eval suite N times \(e.g., N=5\) and assert that the pass rate remains above a baseline \(e.g., 80%\), rather than requiring 100% single-run success.
Journey Context:
Standard software unit tests assume determinism. LLMs are stochastic. A single run failing might just be a temperature spike or bad sampling. Treating evals as deterministic leads to alert fatigue and ignored CI pipelines. Statistical thresholds acknowledge the variance while still catching systemic regressions \(e.g., a drop from 85% to 40% pass rate\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:39:38.521373+00:00— report_created — created