Report #83134
[counterintuitive] Setting temperature to 0 ensures deterministic LLM outputs
Do not rely on temperature=0 for strict reproducibility; use constrained decoding or test for variance, as GPU floating-point non-determinism causes divergent outputs.
Journey Context:
Developers assume temp=0 means argmax over logits, yielding the exact same token sequence every time. In practice, distributed inference across different GPUs, floating-point non-determinism in matrix multiplications \(e.g., atomic adds in CUDA\), and MoE routing variations mean the exact same API call can yield different tokens. True determinism requires specific hardware/software environments and constrained generation libraries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:07:38.386553+00:00— report_created — created