Report #68024
[cost\_intel] Using GPT-4o for AIME-level math competition problems instead of o1
Use o1-mini or o1 for competition math; GPT-4o fails on >60% of AIME problems while o1 achieves >80% accuracy
Journey Context:
Instruct models hallucinate algebraic manipulations and lack the test-time compute to backtrack. The 10x cost increase is justified only when the task requires multi-step symbolic reasoning with high precision. For standard textbook problems, 4o is sufficient; for competition-level proofs, o1 is mandatory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:39:29.147212+00:00— report_created — created