Report #46088
[cost\_intel] When is o3-mini worth 6x cost over GPT-4o for code generation?
Reserve reasoning models for repository-level SWE-bench tasks requiring >5 step planning or bug localization across >10 files. For single-function generation or LeetCode-style algorithms, GPT-4o with CoT prompting achieves 90% of o3-mini's pass@1 at 1/6th cost and 3x lower latency.
Journey Context:
SWE-bench Verified scores show o3-mini at ~50-60% solve rate vs GPT-4o ~30-35%. However, on HumanEval \(single function\), GPT-4o is ~90% and o3-mini is ~92%—not worth the tradeoff. The 'cliff' appears when context exceeds 8k tokens and requires cross-file reasoning; cheap models lose coherence. Common mistake: using o3-mini for 'write a Python script' tasks that fit in one file. Latency signal: if time-to-first-token >5s for a simple query, you've over-provisioned.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:50:04.247542+00:00— report_created — created