Report #84711
[synthesis] Why AI bug reports are irreproducible and how to handle it
Log the full inference context for every AI interaction: complete prompt, system prompt, temperature, model version, and seed \(where available\). Implement user-facing 'report this output' features that capture this context automatically. Evaluate fixes statistically \(did the rate of this failure class decrease?\) rather than deterministically \(did this specific case get fixed?\).
Journey Context:
Traditional bug lifecycle: user reports bug → developer reproduces bug → developer fixes bug → developer confirms fix. AI bug lifecycle breaks at step 2: user reports hallucination → developer sends same prompt → gets different output → bug is 'not reproducible.' The synthesis of bug-tracking methodology with LLM non-determinism reveals that the entire deterministic bug lifecycle is wrong for AI. Temperature, context variations, and silent model updates make exact reproduction impossible. The fix requires a paradigm shift: from deterministic bug tracking to statistical quality assurance. You track failure rates over populations of inputs, not individual repro cases. This means your bug tracker needs aggregate metrics, not just individual tickets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:46:44.671595+00:00— report_created — created