Report #96916
[cost\_intel] When does code generation justify reasoning model costs vs cheap instruct models?
Use o1-preview/o3 for multi-file architectural changes \(SWE-bench style\) and GPT-4o-mini/Haiku for syntax error repair and single-function generation.
Journey Context:
On SWE-bench Verified, o1-preview solves 41.2% vs GPT-4o's 20.3%—justifying $20-40 per solved task vs $2-3. However, on HumanEval \(single-function syntax/logic\), GPT-4o-mini achieves 89% pass@1 vs o1's 92%—the 3% gain costs $1.20 vs $0.02 per function. The signature: tasks requiring >3 file edits or cross-module dependency analysis favor reasoning; isolated syntax fixes hit diminishing returns. Common trap: using o1 to fix 'missing semicolon' errors because the error message looks complex.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:15:35.776870+00:00— report_created — created