Report #63833
[cost\_intel] What is the cost-per-correct-answer comparison between o3-mini and GPT-4o on SWE-bench?
On SWE-bench Verified, o3-mini achieves 40-50% solve rate at $3-4 per task, while GPT-4o achieves 15-20% at $0.50 per task. The cost-per-correct-answer is roughly equal \($7-10\) until task complexity exceeds 'one file change, <50 lines.' Beyond that threshold, only reasoning models succeed regardless of cost, making their higher cost-per-attempt justified by non-zero success rate where cheaper models score zero.
Journey Context:
While o3-mini is 10x more expensive per token, it requires fewer attempts to solve complex software engineering tasks requiring multi-file reasoning. However, for simple bugs \(syntax errors, typo fixes\), GPT-4o solves them on first attempt 80% of the time at 1/6th the cost. The crossover point is task depth: when the fix requires understanding >3 files or >2 logical steps of indirection, GPT-4o's success rate drops below 30% while o3-mini maintains 60%\+. The cost-per-correct-answer curve shows diminishing returns for reasoning models on simple tasks but exponential divergence in their favor for complex tasks. The quality degradation signature in GPT-4o on SWE tasks is 'file blindness'—it fails to recognize that changes in file A require corresponding updates in file B, producing partial fixes that crash on integration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:37:47.858657+00:00— report_created — created