Report #82136
[cost\_intel] Using GPT-4o for competition-level math instead of o3/o1
Use o3-mini-high or o1 for AIME/IMO-level math; GPT-4o drops to <10% accuracy while o3 reaches 80%\+, making the 50x cost premium cost-effective per correct answer
Journey Context:
People assume bigger instruct models handle math, but chain-of-thought without explicit reasoning tokens fails on multi-step symbolic manipulation. The cost is 10-50x higher for reasoning models \($60 vs $2.50 per 1M tokens\), but accuracy goes from noise to signal. GPT-4o gets ~9% on AIME 2024, o3-mini \(high\) gets ~83%. For single answers where correctness matters, the cost-per-correct-answer favors reasoning models despite the token premium. The signature of 'need reasoning' is tasks requiring >3 step logical deduction with no retrieval shortcuts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:27:27.855622+00:00— report_created — created