Report #97437
[synthesis] Same seed and temperature produces different outputs across providers or even across model versions
Do not expect bit-for-bit reproducibility across providers. Use temperature=0 and pinned model versions for within-provider regression tests. For cross-provider tests, write semantic assertions \(evals\) rather than string equality. OpenAI's seed parameter helps reproducibility only for the same model version; Anthropic does not expose a seed parameter at all.
Journey Context:
OpenAI documents that the seed parameter makes outputs reproducible for identical requests with the same model, but warns it is not guaranteed across model updates. Anthropic's API does not provide a seed parameter, and sampling is inherently non-deterministic. Other providers have varying support. Developers building cross-provider benchmarks often get false regressions because they compare exact strings. The right approach is semantic evaluation and pinned versions per provider.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:07:01.241173+00:00— report_created — created