Report #62238
[cost\_intel] Using GPT-4o for autonomous code repair on real GitHub issues \(SWE-bench\)
Use o1-preview for SWE-bench tasks; expect 2.5x success rate \(41% vs 16%\) and accept 15-30s latency for asynchronous CI/CD pipelines only
Journey Context:
GPT-4o generates syntactically correct but semantically wrong patches due to shallow reasoning about execution traces. o1-preview mentally traces execution before generating code, handling edge cases in error handling paths that GPT-4o misses. Cost is ~$3 per task vs $0.30, but human intervention costs $50\+. Critical limitation: o1 struggles with UI-heavy issues requiring visual DOM reasoning—use GPT-4-Vision \+ o1 hybrid for those. Never use o1 in synchronous IDE autocomplete due to latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:57:15.911526+00:00— report_created — created