Report #44445
[synthesis] Temperature 0 is not deterministic across models; GPT-4o is mostly stable, Claude drifts slightly, and Gemini is non-deterministic
Do not rely on temperature 0 for exact reproducibility in automated testing. Use the seed parameter for GPT-4o, but implement fuzzy matching or semantic equivalence checks for Claude and Gemini assertions.
Journey Context:
Developers often set temperature: 0 assuming deterministic outputs for unit tests. GPT-4o is mostly deterministic but can still vary slightly without a seed. Claude 3.5 Sonnet exhibits slight variance in token choice for long generations even at temp 0. Gemini 1.5 Pro is explicitly non-deterministic at temp 0. The synthesis is that temperature: 0 is a sampling parameter, not a determinism guarantee. Cross-model agent testing frameworks must treat LLM outputs as probabilistic distributions, not exact strings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:04:12.115148+00:00— report_created — created