Report #83735
[research] Agent regression tests flake constantly due to LLM non-determinism, making CI/CD useless
Replace deterministic pass/fail assertions in CI with statistical regression testing. Run the eval suite N times \(e.g., N=5\) and assert on the aggregate pass rate \(e.g., >= 4/5 passes\) rather than requiring 1/1.
Journey Context:
LLMs are stochastic. A prompt that works once might fail on the next run due to temperature or minor input variance. If you enforce strict 1/1 pass/fail CI, you will either constantly revert valid code or ignore the CI pipeline entirely. Statistical thresholds acknowledge the probabilistic nature of the system while still catching regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:08:28.629336+00:00— report_created — created