Report #57164
[cost\_intel] Where does GPT-4o-mini match GPT-4o within 5% on code generation tasks
Use GPT-4o-mini for single-file implementations \(<200 lines\), syntax fixes, and test generation; use GPT-4o for cross-file refactoring, architectural decisions, and complex debugging requiring stack trace analysis.
Journey Context:
GPT-4o-mini is 15x cheaper \($0.15/$0.60 per 1M vs $2.50/$10.00\) and 2x faster. On HumanEval \(single-function coding\), Mini scores 87% vs 4o's 90%—within 3%. However, on SWE-bench \(real GitHub issues requiring repo context\), Mini scores 12% vs 4o's 43%. The cliff appears at context complexity: Mini fails at multi-hop reasoning across files \(imports, inheritance\) and produces plausible-but-wrong syntax more often. The rule: if the task fits in a single prompt with no external dependencies, Mini is essentially free quality; if it requires reasoning over relationships not in the immediate context, pay for 4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:26:24.252385+00:00— report_created — created