Report #44598
[synthesis] Temperature=0 produces non-deterministic outputs across calls with some models
Do not rely on temperature=0 for deterministic testing with Claude; accept minor variance or use prompt caching for stability. For GPT-4, use seed parameter \+ temperature=0 for near-deterministic outputs. Build evaluation pipelines that tolerate model-specific variance floors.
Journey Context:
GPT-4 with temperature=0 and the seed parameter is documented as near-deterministic \(mostly consistent outputs with documented fallback behavior\). Claude at temperature=0 still shows minor variance due to implementation details in how sampling is handled — it is not guaranteed to produce identical outputs across calls. This matters enormously for automated testing and evaluation pipelines: a test suite that expects exact match outputs at temperature=0 will flake on Claude but pass on GPT-4. The fix is to build evaluation around semantic similarity or fuzzy matching, and to understand that 'temperature=0' does not mean 'deterministic' across all providers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:19:35.184670+00:00— report_created — created