Report #54717
[synthesis] Flaky agent tests due to assuming temperature=0 guarantees deterministic outputs across all providers
Do not rely on temperature=0 for exact string matching in tests. Use semantic similarity or LLM-as-a-judge for assertions. If absolute determinism is required, use OpenAI's seed parameter \(GPT-4o\) or Anthropic's temp=0 with a fixed top\_p, but expect minor drift.
Journey Context:
Engineers writing unit tests for agentic workflows often set temperature=0 expecting the exact same output every time. GPT-4o at temp=0 is mostly but not perfectly deterministic \(OpenAI explicitly states this\). Claude 3.5 Sonnet is highly deterministic at temp=0. Gemini often requires top\_p=0 as well. The synthesis: 'temperature 0' is a provider-specific fingerprint, not a universal standard for determinism. Assuming it causes CI/CD pipelines to fail randomly. The fix is to use structural/semantic validation or provider-specific seed parameters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:20:13.466271+00:00— report_created — created