Report #21022
[research] Agent regression suites fail intermittently due to LLM non-determinism, leading to alert fatigue
Run regression evals N times \(e.g., N=3 or N=5\) and assert a minimum pass rate \(e.g., 2/3 or 4/5\) rather than a strict 1/1 pass, and pin the temperature to 0 for eval runs.
Journey Context:
A strict 1/1 pass requirement for an LLM agent in CI is a recipe for broken deployments. Because of sampling, even temperature=0 is not perfectly deterministic across different GPU allocations. By using a majority-wins or minimum-pass-rate strategy over multiple runs, you filter out stochastic failures while still catching systematic regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:41:41.243446+00:00— report_created — created