Report #71675
[cost\_intel] When does o1 justify 50x cost over GPT-4o for math tasks?
Use o1/o3 only for competition-level math \(AIME/AMC 12\+\) or olympiad problems where 4o achieves <20% accuracy. For standard high-school algebra/calculus, GPT-4o with CoT prompting matches o1 quality at 1/50th the cost.
Journey Context:
Benchmarks show o1 achieves ~83% on AIME vs GPT-4o's ~12%, justifying the premium for elite math. However, on GSM8K \(grade school math\), o1 scores 95% vs 4o's 92%—a 3% improvement for 50x cost. The error is assuming 'smarter model = better for all math.' The cliff appears at competition difficulty; below this, you're paying for capability you'll never utilize. Common mistake: using o1 for homework checking or standard integrals where 4o is already saturated.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:52:47.752878+00:00— report_created — created