Report #72293
[cost\_intel] When does o3/o1 beat GPT-4o/Claude 3.5 Sonnet by >20% on accuracy?
Use reasoning models \(o1/o3\) only for competition-level mathematics \(AIME/IMO\), formal proofs, and complex symbolic logic where step-by-step verification is required. For standard arithmetic or textbook algebra, use GPT-4o or Claude 3.5 Sonnet.
Journey Context:
The cost delta is 10-30x \($15-60 per million tokens vs $2.50-5\), but the accuracy cliff is severe on competition math: GPT-4o scores ~12% on AIME while o1 scores >83%. However, on standard MATH dataset problems \(high school level\), the gap narrows to <10% while cost remains 10x, making reasoning models economically irrational unless the marginal correctness is worth $50 per answer. Common mistake is using o1 for 'hard math' generally; the breakpoint is specifically problems requiring >5 step chains with high branching factor.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:55:52.587822+00:00— report_created — created