Report #30135
[cost\_intel] When should reasoning models handle debugging over greenfield code writing?
Use o3/o1 for root-cause analysis of complex bugs \(distributed systems, race conditions, memory leaks\) and GPT-4o for boilerplate CRUD generation; reasoning models show 40%\+ higher success on SWE-bench verified vs. code generation baselines.
Journey Context:
SWE-bench Verified results show o1 solves 41.2% of real GitHub issues vs GPT-4o's 16.0%. The delta comes from debugging requiring hypothesis generation and counterfactual reasoning \('if X were true, Y would fail, but Y passes, so...'\). Greenfield code generation is pattern-matching against training data, where instruct models excel. The threshold is 'ambiguity resolution': when the fix requires understanding implicit invariants across multiple files, reasoning models justify the cost; for 'generate React component from spec', they waste tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:58:10.708423+00:00— report_created — created