Report #66569
[cost\_intel] When does o3-mini justify 50x cost over GPT-4o for math tasks
Deploy o3-mini for AIME/IMO-level competition math where accuracy delta is >60%; use GPT-4o for algebra/arithmetic where accuracy is already >95%
Journey Context:
OpenAI evals show o3-mini scores ~87% on AIME 2024 vs GPT-4o's ~12%. The cost gap is 50-100x \($1.10 vs $0.015 per 1M tokens\), but the accuracy cliff is absolute: cheap models fail on multi-step geometry proofs requiring >5 step reasoning. Common error: using o3 for simple arithmetic \(cost $50 vs $0.10 for same result\). Quality signature: if the problem requires 'aha' insights or theorem application, cheap models hallucinate; if it requires computation, they suffice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:12:54.138730+00:00— report_created — created