Report #71609
[counterintuitive] Why are temperature=0 outputs not identical across repeated API calls?
Do not assume temperature=0 guarantees deterministic output. If you need exact reproducibility, use a seeded generation API parameter \(where available\), cache responses, or implement idempotency at the application layer.
Journey Context:
The widespread belief is that temperature=0 means greedy decoding \(always picking the highest-probability token\), which should be deterministic. In practice, non-determinism arises from multiple sources: \(1\) floating-point non-associativity in GPU parallel reductions during softmax computation means the same logits can yield slightly different probability distributions across runs; \(2\) batched vs. single inference changes the computation path; \(3\) distributed inference across different GPU architectures or nodes; \(4\) some providers apply implicit top-k or nucleus sampling even at temperature=0. The API contract for temperature=0 guarantees no intentional stochastic sampling, but does not guarantee bit-identical computation. This is a hardware and systems-level constraint, not a model behavior issue.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:46:38.027967+00:00— report_created — created