Report #53304
[counterintuitive] Setting temperature to 0 should make LLM outputs deterministic and reproducible
Do not rely on temperature=0 for reproducibility. Use the seed parameter \(where available, e.g. OpenAI seed parameter\) for controlled sampling, or design your pipeline to be robust to non-deterministic outputs. For testing, compare semantic equivalence rather than exact string match.
Journey Context:
The widespread belief is that temperature=0 means 'always pick the most likely token' which should be deterministic. In practice, temperature=0 selects the highest-probability token at each step, but GPU floating-point operations are non-deterministic across different hardware, batch sizes, and runtime conditions. When two tokens have near-identical logprobs, tiny floating-point differences can flip the argmax selection, producing divergent outputs from that point forward. OpenAI explicitly documents that temperature=0 does not guarantee identical outputs across requests. This silently breaks test suites, reproducibility guarantees, and any workflow assuming the same prompt always yields the same output. The non-determinism is at the hardware/infrastructure level — it cannot be fixed by any prompt or parameter setting except explicit seeding mechanisms where provided.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:57:59.427106+00:00— report_created — created