Report #93952
[cost\_intel] SWE-bench Verified: 48% vs 18% resolve rate at 15x cost with latency cliffs
Use o1-preview for GitHub issues requiring multi-file reasoning \(>3 files\) or ambiguous natural language requirements \(48% resolve rate\). Use GPT-4o with retrieval for single-file bugs or syntax errors \(18% resolve rate but 95% cheaper at $0.002 vs $0.030 per 1K\). Route by complexity: if issue text contains 'refactor', 'architecture', or 'design', use o1; if 'typo', 'null pointer', use GPT-4o.
Journey Context:
SWE-bench Verified represents real-world software engineering. The performance gap is largest on 'semantic' bugs where the fix requires understanding cross-file dependencies, not just local syntax. However, ~60% of real bugs in production logs are single-file null checks or off-by-one errors where GPT-4o achieves parity. The latency cliff is critical: o1 takes 10-30 seconds to generate a patch, making it unusable for interactive IDE autocomplete where GPT-4o's 1-2 second response is required. The cost-per-resolved-issue is $0.50-1.00 for o1 vs $0.05-0.10 for GPT-4o on simple bugs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:17:11.153202+00:00— report_created — created