Report #90628
[cost\_intel] SWE-bench shows o1 costs 50x more than GPT-4o but only improves 15% on simple bugs, creating negative ROI on easy tickets
Route to GPT-4o for bugs affecting <20 lines or single files; reserve o1 for architectural changes >200 lines or complex concurrency bugs
Journey Context:
On SWE-bench Verified, o1 solves 48% vs GPT-4o's 33% \(15-point gap\). However, on 'easy' subset \(single file, <20 line changes\), o1 achieves 55% vs 4o's 48% \(7% gain\) at 50x cost \($50 vs $1 per task\). Break-even analysis: Use complexity heuristics \(lines changed \+ file count \+ cyclomatic complexity\). Threshold: >0.7 complexity score justifies o1.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:42:52.269135+00:00— report_created — created