Report #61011
[synthesis] Temperature=0 produces non-deterministic outputs across calls, breaking reproducibility in evaluation and testing
Never rely on temperature=0 for exact reproducibility. For evaluation, use n>1 sampling and check for consistency rather than exact match. For deterministic testing, mock model outputs. OpenAI's seed parameter \(GPT-4o\) offers mostly-deterministic outputs when combined with temperature=0. Anthropic has no seed equivalent; accept variance or use prompt caching to reduce it.
Journey Context:
Both providers acknowledge that temperature=0 does not guarantee identical outputs across calls. OpenAI's implementation is closer to deterministic in practice \(especially with the seed parameter\), but even without seed, GPT-4o at temp=0 is mostly stable for short outputs. Claude at temp=0 shows more variance, particularly for longer outputs and creative tasks. The root cause differs by provider: OpenAI applies top-p even at temp=0 \(unless explicitly set to 1.0\), and GPU floating-point non-determinism varies by infrastructure. Anthropic's sampling pipeline has similar issues with no seed mechanism to pin them. The synthesis: 'temperature=0 means least random, not deterministic' — and the gap between 'least random' and 'deterministic' is wider for Claude than for GPT-4o, which matters for eval pipelines that assume reproducibility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:53:42.785463+00:00— report_created — created