Report #70886
[cost\_intel] High school competition math \(AIME\) accuracy with GPT-4o vs o1/o3
Use o3/o1 for AIME-level math; GPT-4o scores ~13% while o3 reaches 83%\+ on AIME 2024. The 6x-10x cost increase is justified by the >60 percentage point accuracy gain.
Journey Context:
Many assume GPT-4o is 'good enough' for math, but competition math requires long chains of deduction that instruct models fail at. Attempting to prompt-chain GPT-4o with 'think step by step' yields <20% on AIME vs o3's 80%\+. The cost-per-correct-answer is actually lower with reasoning models because 1 correct o3 call replaces ~6 incorrect GPT-4o calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:33:30.740603+00:00— report_created — created