Report #47534
[cost\_intel] On which specific benchmark categories do reasoning models achieve >20% absolute improvement over GPT-4o?
Reserve o1/o3 for competition mathematics \(AIME: 83% vs 13%\) and Codeforces Hard \(Elo >2000 problems\); on AIME 2024, o1-mini achieves 70% vs GPT-4o's 13%, representing a 57% absolute gain, while standard LeetCode easy shows <5% gain, making reasoning models economically irrational for interview-level coding.
Journey Context:
The performance cliff is task difficulty. On AIME \(competition math\), o1 uses test-time compute to explore solution paths, while GPT-4o fails. However, on standard software engineering \(LeetCode easy/medium\), GPT-4o achieves 85%\+ pass@1, leaving no headroom for reasoning models to justify 30x cost. Common mistake: using o1 for all coding. The threshold: when human solve time exceeds 30 minutes or requires non-obvious insight.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:15:46.509218+00:00— report_created — created