Report #50746
[cost\_intel] Real-world software engineering \(SWE-bench Verified\) resolution rates
Use o1 for bug fixing and complex refactoring; use GPT-4o only for boilerplate generation. o1 resolves 41% of issues vs GPT-4o's 11%, making the 8x cost justified for production bugs.
Journey Context:
SWE-bench requires understanding large codebases, executing tests, and multi-file edits. Instruct models lack the coherence for context windows >10k tokens in reasoning mode. Common mistake is using cheaper models for 'quick fixes' that actually require architectural understanding, leading to 4x more failed CI runs. The total project cost is lower despite per-call expense because fewer retries are needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:39:40.606815+00:00— report_created — created