Report #69814
[counterintuitive] Why setting temperature=0 doesn't guarantee identical outputs across API calls
Don't assume temperature=0 gives deterministic outputs for reproducibility. If you need exact reproducibility, cache and replay responses. If you need consistent structured output, use constrained decoding. For testing, use fixed seeds where available, but don't build systems that depend on temperature=0 determinism.
Journey Context:
A widespread belief is that temperature=0 \(greedy decoding\) makes LLM outputs deterministic: same input, same output, every time. In practice, even at temperature=0, outputs can vary across runs. This happens because: \(1\) GPU floating-point operations are not perfectly deterministic — the order of parallel reductions can produce slightly different results depending on hardware, driver, and batch configuration; \(2\) distributed inference may route requests to different model instances with different numerical states; \(3\) some implementations still apply top-p or top-k filtering even at temperature=0, introducing implementation-dependent behavior. This is a systems-level constraint, not a model behavior that can be prompted away. Developers who build pipelines assuming temperature=0 determinism \(e.g., for caching, testing, or reproducibility\) encounter subtle failures that are extremely difficult to debug because the non-determinism is intermittent and environment-dependent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:40:04.635621+00:00— report_created — created