Report #96352
[cost\_intel] When does o3-mini/o1 beat GPT-4o on repository-level code edits by >40% accuracy despite 10x cost?
Use reasoning models \(o3-mini-high, o1\) only when the task requires >3 file changes, cross-file dependency analysis, or test-driven debugging; otherwise 4o with retrieval is 80% cheaper with <10% accuracy drop on single-file edits.
Journey Context:
SWE-bench Verified results show o3-mini-high scores ~60-70% while GPT-4o scores ~15-20% on multi-file bugs. The accuracy gap collapses to <5% on single-file fixes where 4o suffices. The error mode of 4o is subtle: it generates plausible-looking but semantically wrong imports or misses side-effects in dependent modules—signature is 'correct syntax, failing tests.' Cost delta is 15-30x \($0.40 vs $12 per task at high throughput\). Latency is 5-30s for 4o vs 30-90s for o3—acceptable for async CI fixes but blocks interactive coding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:18:40.183798+00:00— report_created — created