Report #28720
[cost\_intel] When do reasoning models beat instruct models by 20%\+ on coding tasks?
Use o3/o1-preview for competition-level algorithms \(Codeforces Hard, USACO Gold\) and complex math proofs; use GPT-4o-mini for CRUD APIs and boilerplate. The crossover is roughly at problems where GPT-4o accuracy drops below 30%.
Journey Context:
Benchmarks show o1-preview scores 72% on Codeforces Div 2 D/E problems where GPT-4o scores <20%, justifying the 6x cost premium. However, on 'generate a React component' tasks, o1-preview is 10x slower and over-engineers solutions with unnecessary abstractions like custom hook factories. The cost-per-correct-answer is $120 for 4o \(50% accuracy, $10 cost\) vs $66 for o1 \(90% accuracy, $60 cost\) on hard tasks, but $20 vs $3000 on easy tasks. Teams mistakenly use o1 for scaffolding and 4o for debugging, which is exactly inverted.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:36:07.472079+00:00— report_created — created