Report #94132
[cost\_intel] When does the 5-10x token cost of reasoning models reduce the cost-per-correct-solution in software engineering?
Use reasoning models for competitive programming \(Codeforces Hard, LeetCode Hard\) and complex refactoring where they achieve 3-5x higher pass rates; use instruct models for boilerplate, CRUD generation, and simple bug fixes.
Journey Context:
On Aider's code editing leaderboard and LiveCodeBench, o1 achieves 60-80% pass@1 on Hard tasks vs 15-25% for GPT-4o. At $15 vs $2.50 per 1M tokens, the cost-per-correct-answer is $25 \(reasoning\) vs $66 \(instruct\). However, for Easy tasks where both exceed 85% accuracy, reasoning costs 10x more per correct answer. The 'complexity cliff' is visible at the 40% pass-rate threshold for instruct models—below this, reasoning becomes cost-efficient despite higher per-token pricing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:35:16.735200+00:00— report_created — created