Report #54197
[cost\_intel] When to pay 30x for o3 vs 4o on coding tasks
Only use reasoning models when baseline model pass rate <40%; otherwise cheap model \+ iteration is cheaper and same quality.
Journey Context:
On SWE-bench Verified, GPT-4o scores ~20% while o1 scores ~40%, justifying the 15-20x cost for high-value automation. But on standard leetcode easy/medium, GPT-4o already hits 80%\+; o1 lifts this to 90% but costs 20x more per correct answer. The breakpoint is 40% baseline: below this, reasoning models show 2-4x relative improvement; above it, gains are marginal \(<15%\). For business logic CRUD apps where GPT-4o already succeeds 85% of the time, use cheap model with retry loops rather than reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:27:59.596331+00:00— report_created — created