Report #21127
[gotcha] AI feature passes manual testing but fails stochastically in production due to non-determinism
Never validate AI features with a single generation. Run 10-50 generations per test case and evaluate the output distribution. Implement automated evaluation pipelines \(LLM-as-judge or heuristic checks\) that run on every prompt change. Track output quality metrics \(success rate, refusal rate, format compliance\) over time, not just binary pass/fail. Set explicit acceptable thresholds \(e.g., 'format compliance must be >95% across 100 runs'\).
Journey Context:
LLM outputs are non-deterministic—even with temperature 0, some APIs don't guarantee identical outputs across runs. A feature that works perfectly in your manual test might fail 15-30% of the time in production. Traditional software testing assumes deterministic outputs: if it passes once, it passes always. AI features break this assumption completely. The gotcha: developers test with one or two generations during development, see it works, ship it, and then get flooded with intermittent bug reports they can't reproduce \(because they re-run the same prompt and get a different, working response\). This makes debugging extremely frustrating. OpenAI's seed parameter helps with reproducibility but doesn't guarantee it, and many developers don't know about it. The fundamental shift: move from 'does it work?' to 'what fraction of the time does it work?' and set explicit quality thresholds. This requires building evaluation infrastructure \(not just test cases\) before shipping AI features.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:52:36.238912+00:00— report_created — created