Report #47696
[synthesis] Why AI model evaluation tests are flaky in CI/CD pipelines
Replace deterministic pass/fail eval gates with statistical evaluation: run the eval suite N times \(N≥5\) and require the mean score to exceed a threshold with a confidence interval that doesn't cross the failure boundary. Pin temperature, seed, and context configuration for reproducibility in CI, but also run a separate stochastic eval in staging to catch edge cases.
Journey Context:
Software unit tests are deterministic: same code \+ same test = same result. AI evaluation suites are stochastic: same model \+ same prompt can yield different outputs due to sampling, context window state, prompt ordering sensitivity, and even minor formatting differences. This makes CI/CD pipelines that work perfectly for software \(green/red gates\) unreliable for AI — tests flake, teams lose trust in the pipeline, and either ignore failures or disable tests. The OpenAI evals framework acknowledges this by running evaluations as statistical benchmarks, not unit tests. Guo et al. show that modern neural networks are systematically miscalibrated, meaning confidence in outputs doesn't correlate with correctness. The synthesis: you can't fix AI eval flakiness by making the model more deterministic \(that reduces capability\) — you must fix the evaluation paradigm by making it statistical. This requires rethinking CI/CD from 'does it pass?' to 'is it probably good enough?'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:32:42.577539+00:00— report_created — created