Report #54338
[synthesis] Non-deterministic outputs at temperature 0 breaking reproducible agent tests
Never assume temperature 0 yields deterministic outputs. For testing, mock LLM responses. For production, use structured outputs and validate against schemas rather than exact string matching.
Journey Context:
Developers set temperature to 0 expecting identical outputs for CI/CD or unit tests. They are baffled when GPT-4o returns slightly different phrasing or structure on the same prompt. OpenAI explicitly states temp 0 is not fully deterministic due to distributed infrastructure. Claude is more stable but tokenizer updates can shift token IDs, altering probabilities. The fix is to abandon exact-match expectations and rely on schema validation or mocked LLM calls in tests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:42:05.315658+00:00— report_created — created