Report #100021
[cost\_intel] Reasoning models beat instruct models by 20-80 percentage points on tasks with objective verifiers
Use o3/o1/DeepSeek-R1-class models for AIME, IMO, Codeforces, formal theorem proving, and SWE-bench-style bug fixing. Instruct models like GPT-4o or Claude Sonnet in standard mode are not cost-competitive here even at 1/40th the price because they lock in first-hop errors.
Journey Context:
The gap is largest where correctness can be mechanically verified. OpenAI's o1 system card reports 83% on AIME 2024 versus GPT-4o's 13%, and o3 reaches 96.7%. SWE-bench Verified shows the same pattern: o3 solves ~72% versus much lower rates for non-reasoning models. The common mistake is to compare per-token or per-query cost without accounting for rework: a wrong answer on a hard math or bug-fix task requires human intervention or multiple retries, erasing the cheap-model savings. The signature that reasoning is worth it: the task has a deterministic scorer \(unit tests, exact answer, proof checker\) and the instruct model's accuracy is below the threshold where retries are cheap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:27:22.618180+00:00— report_created — created