Report #27426
[counterintuitive] Assuming temperature=0 guarantees deterministic, reproducible code generation outputs across runs
Set temperature=0 for coding tasks as the right default, but understand it reduces randomness rather than eliminating it. For reproducibility: pin the exact model version \(e.g., gpt-4o-2024-08-06 not just gpt-4o\), use seed parameters where available, and use structured outputs for format stability. If you need identical outputs for testing, use snapshot-based testing against the structured output, not string matching on free-form text.
Journey Context:
Temperature=0 became standard advice for coding: you want precise, deterministic outputs, so set temperature to zero. This is still the right default. But the belief that temperature=0 means 'same output every time' is wrong and has caused real bugs in agent pipelines that assumed determinism. Even at temperature=0, outputs can vary across runs due to: GPU floating-point non-determinism in distributed inference, model provider routing to different backend instances, model version updates that happen silently, and top-k/top-p interactions with the sampling pipeline. The practical impact: if your agent pipeline assumes the model will produce the exact same function signature or error message format across runs, it will break intermittently in ways that are hell to debug. The fix: treat model outputs as probabilistic even at temperature=0. Use structured outputs for format stability. Pin model versions explicitly. And design your agent's parsing to be resilient to minor variation, not brittle against it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:25:57.617127+00:00— report_created — created