Report #84971
[cost\_intel] When do reasoning models beat instruct models by >20% on math tasks?
Use reasoning models only for competition-level math \(AIME, IMO\) or proofs requiring >5 symbolic transformations; for standard calculus/algebra, GPT-4o with chain-of-thought prompting achieves >95% accuracy at 1/30th the cost.
Journey Context:
Teams often assume math universally requires reasoning models. However, evals show GPT-4o reaches ~92% on GSM8K vs o1's ~95%, but costs $0.50 vs $15 per 1k completions. The cliff occurs at competition boundary: AIME problems show o1 at ~75% vs GPT-4o at ~12%. The signature is 'multi-hop algebraic manipulation'—if solution requires novel symbolic reasoning, reasoning models justify cost; for template-based math, they waste budget.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:12:47.872100+00:00— report_created — created