Report #45577
[cost\_intel] SWE-bench simple vs complex: reasoning models waste budget on easy fixes
Use GPT-4o for single-file bugs with <30 line changes; reserve o1/o3 for multi-file architectural changes or complex state management bugs.
Journey Context:
SWE-bench Verified analysis shows o1 achieves 48% pass@1 vs GPT-4o's 33% overall, but on 'simple' instances \(single file, <20 lines changed, no new dependencies\), the gap narrows to 5% \(82% vs 77%\) while cost increases 10x. The cost-per-bug-fixed on simple instances is $2.40 for o1 vs $0.24 for GPT-4o. Quality signature: if issue description mentions only one file and no 'architecture' or 'refactor' keywords, use GPT-4o first; escalate to o1 only after 2 failed attempts or if files\_changed >3.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:58:36.636771+00:00— report_created — created