Report #80430
[cost\_intel] When does o1 justify its cost for code generation versus GPT-4o?
Use o1 only for hard algorithmic problems \(SWE-bench verified hard instances\) or complex refactoring requiring >10-file architectural reasoning; for standard API glue, CRUD, or simple bug fixes, GPT-4o with retrieval achieves 90%\+ pass@1 at 1/10th the cost \($10 vs $60 per 1M output tokens\) and 5x lower latency.
Journey Context:
SWE-bench verified shows o1-preview at ~41% resolve rate vs GPT-4o at ~33% on the full set, but on 'easy' instances \(single file, <50 lines changed\), GPT-4o matches o1. The cost delta is ~$60/1M output tokens for o1 vs $10/1M for GPT-4o. The quality degradation signature for GPT-4o is 'shallow fixes' that address symptoms not root cause when the bug spans >3 files. The alternative is a 'cascade': GPT-4o generates 3 candidate patches, o1 acts as judge \(ranking them\), reducing cost by 70% while keeping 95% of o1's resolve rate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:36:45.584258+00:00— report_created — created