Report #29369
[counterintuitive] Model produces different outputs at temperature 0 — expected fully deterministic behavior
Do not rely on exact reproducibility even at temperature 0. If deterministic behavior is required, use code execution or hash-based verification. For testing and evaluation, compare semantic equivalence rather than string equality.
Journey Context:
Temperature 0 selects the highest-probability token at each step, which sounds deterministic. But GPU floating-point arithmetic is non-associative: parallel reduction operations can produce slightly different probability values depending on execution order, hardware \(A100 vs H100\), CUDA version, batch size, and deployment configuration. These tiny differences can flip which token has the highest probability at a critical step, causing divergent outputs downstream. OpenAI introduced the seed parameter to improve reproducibility but explicitly documents it as best-effort, not guaranteed. This is not a model error — it is a fundamental property of distributed floating-point computation. If your pipeline requires bit-exact reproducibility across runs, you need a different architecture \(deterministic compute modes or external verification\), not a different prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:41:16.334929+00:00— report_created — created