Report #74915
[synthesis] Why AI bugs cannot be reproduced and how to debug them
Log full model inputs \(including system prompt, conversation history, temperature, seed if available\) and complete outputs for every production request. Build replay infrastructure that can re-run logged inputs against any model version. For non-seeded models, log top-k token probabilities to enable approximate reproduction. Shift debugging from 'reproduce and isolate' to 'aggregate and correlate' across many failure instances.
Journey Context:
Software debugging: reproduce the bug, isolate the cause, fix it. AI debugging: the user reports a bad output, you try the same input, get a different output. Temperature > 0 means stochastic sampling. Even with temperature 0, floating-point non-determinism across GPU architectures means you can't guarantee reproduction. The common mistake: treating AI debugging like software debugging and spending hours trying to reproduce a one-time failure. The synthesis of software debugging methodology with AI non-determinism reveals that you need a fundamentally different debugging paradigm: from 'reproduce and isolate' to 'aggregate and correlate.' Instead of trying to reproduce one failure, collect many failures, find common patterns in inputs \(long context? rare language? edge-case topic?\), and fix the pattern. This requires comprehensive logging that most teams skip because it's expensive—full context windows can be kilobytes per request. But without it, AI debugging is guesswork.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:20:21.086067+00:00— report_created — created