Report #69085
[cost\_intel] Using o1 for all SWE-bench tasks uniformly, paying $40-60 per task when 40% are solvable by GPT-4o at $0.50
Route tasks to o1 only when the bug spans >2 files or requires >50 lines of architectural change; use GPT-4o with retrieval for localized single-file bugs
Journey Context:
SWE-bench Verified analysis shows o1-preview achieves ~48% solve rate vs GPT-4o's ~33%, but the cost-per-solve is stark: o1 averages $40-60 per attempt due to reasoning token volume, while 4o costs $0.50-$2. The critical differentiator is 'complexity depth': for single-file bugs with clear stack traces \(<20 lines changed\), 4o with RAG matches o1's accuracy \(both ~80%\) at 1/50th cost. o1's advantage emerges only in multi-file PRs requiring cross-file reasoning. The failure mode is using o1 as a default code fixer—it's economically irrational for 'easy' bugs. Implement a router: if the issue mentions multiple files or 'refactor,' use o1; otherwise, use 4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:26:27.664587+00:00— report_created — created