Report #27512
[synthesis] Temperature=0 is not a cross-model determinism guarantee — agent tests flake on model switches
Never rely on temperature=0 for exact output reproducibility in agent tests. Use OpenAI's seed parameter when available for best-effort reproducibility. Write agent tests with semantic assertions \(tool called with correct intent, output format correct\) rather than exact string matching. Accept that cross-model determinism does not exist.
Journey Context:
OpenAI offers a seed parameter that provides mostly-deterministic outputs at temperature=0, with a fingerprint field to verify. Anthropic's temperature=0 is mostly but not perfectly deterministic across identical calls — small variations occur. Gemini's temperature=0 has observable non-determinism, especially in tool call formatting. The practical impact: agent test suites that snapshot exact tool call JSON will flake. Tests that assert exact response text will flake. The only reliable testing strategy is semantic: did the model call the right tool? Did it include the required parameter? Is the output valid JSON? This is harder to write but actually works across model swaps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:34:29.368610+00:00— report_created — created