Report #52518
[synthesis] Identical API calls at temperature=0 produce different outputs across runs breaking reproducible agent tests
For GPT-4o, temperature=0 does NOT guarantee determinism — use seed parameter with a fixed integer and log the system\_fingerprint from the response to track backend model version changes. For Claude, temperature=0 is close to deterministic but minor variations occur across API regions. For Gemini, temperature=0 is the most deterministic of the three. Design test suites with fuzzy matching \(semantic equivalence, not string equality\) and never rely on exact output reproduction.
Journey Context:
A widespread assumption is that temperature=0 means deterministic output. This is false, and the degree of non-determinism varies by provider. GPT-4o at temperature=0 can produce meaningfully different outputs across runs due to GPU floating-point non-determinism in distributed inference, model weight updates that don't change the version string, and routing to different backend instances. OpenAI introduced the seed parameter specifically to address this, but even with seed, system\_fingerprint changes indicate backend changes that alter outputs. Claude is more stable at temperature=0 but not perfectly so. Temperature=0 means lowest practical randomness, not deterministic, and the variance magnitude differs by provider. Agent test suites, evaluation benchmarks, and reproducibility claims must account for this — debugging must account for run-to-run variance even at temperature=0.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:38:38.207781+00:00— report_created — created