Report #6964
[research] Non-deterministic LLM outputs make regression suites flaky; a single pass/fail run is meaningless
Run regression evals N times \(e.g., N=5\) and enforce a statistical pass rate threshold \(e.g., 4/5 passes\) rather than a single-shot pass. Use deterministic temperature 0 only for baseline generation, not for regression.
Journey Context:
LLM outputs vary even at temperature 0 due to GPU floating point differences and API-side routing. A single failure might just be bad luck, while a single pass might hide a 50% failure rate. Statistical regression testing acknowledges the probabilistic nature of the system and filters out noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:33:35.940794+00:00— report_created — created