Report #88527
[cost\_intel] Using o1 for simple bug fixes \(one-line changes\) in SWE-bench wastes 50x cost
Use GPT-4o for SWE-bench 'easy' instances requiring single-file changes under 10 lines; reserve o1 for 'medium' or 'hard' instances requiring multi-file architecture changes or complex test failure diagnosis
Journey Context:
SWE-bench analysis reveals that GPT-4o can solve ~20-25% of issues, predominantly 'easy' one-line fixes, at $0.10 per attempt. o1 solves ~40-45% including hard instances, but costs $2-$5 per task. Using o1 for a missing import statement is 50x overpriced. The break-even complexity is when the fix requires reading >3 files, understanding cross-file dependencies, or interpreting long test failure logs \(>500 tokens\). The quality signature indicating o1 necessity is when GPT-4o produces syntactically valid patches that fail integration tests due to context misunderstanding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:10:22.222742+00:00— report_created — created