Report #77148
[cost\_intel] Cost-quality cliff in code generation when using GPT-4o-mini vs GPT-4o
Use GPT-4o-mini for boilerplate generation, dependency injection, and unit tests under 150 lines where context spans ≤3 files; switch to GPT-4o immediately for cross-file refactoring, async complexity, or modifying existing class hierarchies—quality cliff appears at context complexity requiring >5 nested scope levels.
Journey Context:
Benchmarks show GPT-4o-mini achieves 87% of GPT-4o's HumanEval pass@1 for simple functions, but only 40% on SWE-bench tasks requiring multi-file editing. The cost difference is 15-20x \($0.15 vs $2.50 per 1M tokens\). The trap is using mini for 'quick fixes' in legacy codebases where implicit dependencies span 10\+ files. The signature of failure: mini generates syntactically correct code that ignores side effects in distant files or hallucinates method signatures that don't exist. The fix is a context complexity heuristic: count unique file imports and local variable scope depth; if >3 files or >5 nested scopes, upgrade immediately to avoid silent logic errors that cost more in debugging than the token savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:05:15.046247+00:00— report_created — created