Report #99893
[synthesis] Same prompt gives inconsistent answers across repeated runs, or the model flips between correct and incorrect
For Claude 3.5/Sonnet, expect high decisiveness—if it is wrong once it tends to be wrong consistently, so use self-consistency or external verification rather than simple retries. For GPT-4o, expect more run-to-run variance, so averaging multiple samples can improve reliability. Always set temperature=0 for deterministic evaluation but note it does not eliminate model-family variance.
Journey Context:
A Meital 2025 study found Claude 3.5 is more 'decisive' than GPT-4o across repeated questions: it is either right or wrong on all repetitions, while GPT-4o shows more mixed correct/incorrect patterns. This matters for agent design: with Claude, retries are less likely to help; with GPT-4o, ensemble methods can help. The common mistake is to assume all models behave like stochastic dice with the same variance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:14:17.146927+00:00— report_created — created