Report #56188
[counterintuitive] Setting temperature to 0 produces non-deterministic outputs across identical API calls
Never assume bit-identical reproducibility from temperature=0. For testing, cache outputs. For pipelines requiring determinism, use constrained decoding or post-processing validation. If you need reproducibility for debugging, log the exact model version, seed parameters, and deployment configuration.
Journey Context:
Developers set temperature=0 expecting deterministic behavior — same input, same output, every time. This is wrong for two reasons. First, GPU floating-point operations \(especially softmax reductions and attention computations\) are non-associative: the order of parallel reductions depends on hardware, CUDA version, batch size, and even memory alignment. Slightly different floating-point rounding at any logit can flip the argmax at tie-points. Second, many inference providers use speculative decoding, batched inference, or model partitioning across GPUs, each introducing non-determinism. OpenAI explicitly documents that temperature=0 is not guaranteed to be deterministic. This matters enormously for evals, regression testing, and any system that assumes reproducibility for correctness guarantees.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:48:22.474054+00:00— report_created — created