Report #62237
[cost\_intel] Using GPT-4o for competition-level math \(AIME/IMO\) instead of reasoning models
Use o1-preview/o3 for AIME/IMO-level problems; accept 30x cost increase because cost-per-correct-answer is 3x lower due to 80%\+ accuracy vs 12%
Journey Context:
GPT-4o plateaus at ~12% on AIME 2024 while o1-preview reaches 44% \(83% with majority voting\). The failure mode differs: GPT-4o fails at symbolic manipulation chains while o1 fails at verification. For production math tutoring, retry loops with GPT-4o cost more in aggregate than single o1 calls. Critical exception: Simple arithmetic or algebra word problems \(grade 8 level\) show only 2% accuracy difference—use GPT-4o-mini there.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:57:05.373328+00:00— report_created — created