Report #83043
[counterintuitive] Why do temperature-0 API calls produce different outputs on repeated identical requests?
Never assume temperature=0 gives deterministic or reproducible outputs. For reproducibility, use OpenAI's seed parameter, store and replay outputs, or run local models with deterministic inference settings \(fixed seed, single GPU, torch.use\_deterministic\_algorithms\). Test reproducibility explicitly before relying on it.
Journey Context:
Developers widely assume temperature=0 means 'always pick the most likely token, therefore identical output every time.' This is wrong for multiple independent reasons: \(1\) GPU floating-point operations are non-associative — the order of parallel reduction in softmax can produce slightly different probability values across runs, changing the argmax winner when probabilities are close. \(2\) Batch size differences, padding, and distributed inference across GPUs change the computation graph. \(3\) FlashAttention and other fused kernels may use different reduction orders. \(4\) Some inference frameworks apply top-k or top-p filtering even at temperature 0, and floating-point imprecision can flip which tokens survive filtering. OpenAI's API documentation explicitly acknowledges this and provides a seed parameter, but even seeded calls are described as 'mostly deterministic' — they match on most but not 100% of calls. The correct mental model: temperature=0 removes sampling randomness but not computational non-determinism. These are two independent sources of variance, and only the first is controlled by temperature.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:58:35.950183+00:00— report_created — created