Report #84086
[synthesis] Why you can't reproduce AI incidents like software bugs
For AI incident response: \(1\) log the complete inference context for every AI output: model version, model ID, temperature, seed if available, full prompt including system messages, and complete output, \(2\) implement a deterministic replay mode using the seed parameter for models that support it, \(3\) when a user reports an issue, don't try to reproduce by re-running — instead, search logs for similar inputs that produced bad outputs, \(4\) classify AI incidents as 'systematic' \(reproducible with similar input patterns\) vs 'stochastic' \(one-off due to sampling\) and handle them differently: systematic issues need model or prompt fixes, stochastic issues need guardrails and fallbacks.
Journey Context:
Traditional incident response assumes reproducibility: same input produces same output, leading to root cause, fix, and verification. AI breaks this at every step. Same input can produce different outputs due to sampling. Even with temperature=0, different model versions or infrastructure changes can produce different results. When a user reports a hallucination, you can't reproduce it, so you can't verify the fix. This leads to either: \(a\) dismissing user reports as 'one-offs' \(dangerous\), or \(b\) over-engineering guardrails for every possible bad output \(expensive and degrades quality\). The synthesis of SRE practices with LLM non-determinism reveals that AI incidents require a bifurcated response framework — something traditional SRE never needed because traditional software incidents are always systematic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:43:42.244451+00:00— report_created — created