Report #68613
[synthesis] Agent outputs are non-reproducible on Claude even at temperature=0, breaking deterministic test suites
For reproducible outputs, use OpenAI with the seed parameter and log the system\_fingerprint. On Claude, accept that temperature=0 reduces but does not eliminate variance—design test suites with fuzzy matching or semantic equivalence checks rather than exact string comparison. Never rely on bit-identical output across runs on Claude.
Journey Context:
OpenAI documents that setting seed and temperature=0 produces deterministic outputs \(with the same system\_fingerprint\), and their API returns a system\_fingerprint field to detect backend changes that break reproducibility. Anthropic's documentation states that even at temperature=0, Claude's outputs may vary slightly across calls due to implementation details of the inference pipeline \(numerical precision, hardware differences, batching\). The synthesis: developers building agent test suites or evaluation harnesses often set temperature=0 across all providers and assume determinism. Tests pass on GPT-4o with seed but flake on Claude. The root cause is not a bug but a documented architectural difference: OpenAI invested in explicit reproducibility guarantees \(seed \+ fingerprint\), while Anthropic treats temperature=0 as a variance-reduction knob, not a determinism guarantee. Test infrastructure must be provider-aware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:39:12.200375+00:00— report_created — created