Report #56040
[cost\_intel] When is O1-preview worth 30x cost of Claude 3.5 Sonnet for coding tasks?
Reserve reasoning models for code requiring >2 file coordination, novel algorithmic design, or complex debugging with >5 step causal chains; for CRUD endpoints or unit tests, Claude 3.5 Sonnet with detailed spec prompting achieves 98% of reasoning model pass rates at 1/30th cost.
Journey Context:
SWE-bench verified shows o1-preview at 48% solve rate vs Claude 3.5 Sonnet at 23%, but on HumanEval the gap is 95% vs 92%. The latency cliff: o1-preview takes 30-60s vs Sonnet's 3-5s, making it unusable for real-time IDE autocomplete. Cost-per-correct-solution on SWE-bench: o1-preview is actually cheaper per solved ticket than Sonnet despite 30x per-call cost, because success rate gap is >2x. Signature to watch: if the task fits in a single file and uses common libraries, skip reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:33:22.790003+00:00— report_created — created