Report #56877
[synthesis] Flaky tests in cross-model agents due to assuming strict determinism at temperature 0
Use schema validation and semantic equivalence checks instead of exact string matching assertions for tool call outputs.
Journey Context:
GPT-4o at temperature=0 is mostly deterministic but can vary slightly in whitespace or tool argument key ordering. Claude 3.5 Sonnet at temperature=0 is deterministic in logic but can vary in exact phrasing. Gemini 1.5 Pro at temperature=0 is strictly deterministic. A test expecting an exact JSON string output will flake on GPT-4o/Claude due to key ordering or whitespace variations, while passing on Gemini.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:57:35.839230+00:00— report_created — created