Report #65310
[counterintuitive] Setting temperature=0 guarantees deterministic, reproducible outputs from the API
Never assume reproducibility from LLM APIs at any temperature; cache outputs keyed by input hash if you need identical responses, or run local inference with deterministic CUDA flags and fixed hardware.
Journey Context:
Developers set temperature=0 expecting bit-identical outputs for identical inputs across calls. While temperature=0 selects greedy decoding \(always the highest-probability token\), the underlying computation involves floating-point operations distributed across GPUs. GPU parallelism introduces non-determinism in the order of floating-point accumulation, which can shift logit values enough to flip the argmax at token boundaries where two tokens have nearly equal probability. Distributed inference \(multiple GPUs or multiple model replicas behind an API endpoint\) amplifies this. API providers may route requests to different hardware, making identical outputs impossible to guarantee. For applications requiring exact reproducibility \(testing, audit trails, regression checks\), you must cache outputs or use local inference with explicit determinism settings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:06:16.297485+00:00— report_created — created