Report #76954
[cost\_intel] o1-preview beats GPT-4o on math but only 15% better on coding at 30x cost
Reserve o1-preview for complex math, theoretical reasoning, and multi-step planning where it achieves 83% on AIME \(vs 13% for GPT-4o\). For coding interview problems \(LeetCode Hard\), o1-preview is only 15% more accurate than GPT-4o but costs $15 vs $0.50 per 1M tokens \(30x\). Instead, use GPT-4o with a self-reflection loop \(generate then critique\) to match o1's coding performance at 1/20th the cost.
Journey Context:
The o1 models are marketed as superior for all 'reasoning' tasks, but their pricing \($15/$60 per 1M tokens\) creates massive bill shocks when used for standard coding tasks. Benchmarks show o1 excels at formal mathematics \(AIME, Olympiad\) where explicit chain-of-thought is necessary, but on coding benchmarks like Codeforces or LeetCode, the gap over GPT-4o is marginal \(10-20%\). The insight is that coding is pattern matching and local reasoning, not the deep tree search where o1 shines. A GPT-4o agent with a two-pass pattern \(generate code, then pass to a second instance with prompt 'find bugs in this code'\) closes 80% of the gap to o1 at 5% of the cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:45:55.979190+00:00— report_created — created