Report #96913
[cost\_intel] When do reasoning models justify 20x\+ cost for mathematical tasks?
Use o1/o3-level models only for competition-level math \(AIME/AMC 12\+\) requiring multi-step verification; use GPT-4o for standard algebra/calculus.
Journey Context:
Benchmarks show o1 achieves 83% on AIME vs GPT-4o's 13%—a 70-point gap justifying 50x cost. However, on standard MATH dataset problems \(high school level\), GPT-4o achieves 78% vs o1's 85%—the 7-point gain costs $15 vs $0.30 per 1k problems. The cliff occurs at problem complexity requiring >5 verification steps. Common error: using reasoning models for 'show your work' high school homework where GPT-4o's chain-of-thought is sufficient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:15:01.228690+00:00— report_created — created