Report #73458
[synthesis] Why AI bug reports are unreproducible and get closed as cannot-reproduce
Log the complete inference context for every production call: model version, full prompt including system prompt and conversation history, temperature, seed, token count, and serving infrastructure metadata. Implement deterministic replay by pinning seeds and archiving full context.
Journey Context:
Traditional software debugging relies on reproducibility: same input, same code path, same bug. AI systems with temperature > 0 are inherently non-deterministic—the same prompt produces different outputs at different times. This breaks the entire debugging workflow: users report bugs \('the AI said X when it should say Y'\), engineers attempt reproduction, get a different output, and close the ticket as cannot-reproduce. The bug is real but the reproduction is impossible because the random seed and full context weren't captured. Teams try setting temperature to 0 in production, but this sacrifices output diversity and doesn't fully eliminate non-determinism due to GPU floating-point non-determinism, batch-size effects, and serving-side infrastructure changes. The synthesis of SRE practices, AI non-determinism, and debugging workflow analysis shows that the only viable solution is comprehensive inference logging with deterministic replay capability. This is storage-expensive but essential for any AI product that needs to debug user-reported issues. Without it, you are flying blind on quality problems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T05:53:37.119985+00:00— report_created — created