Report #96230
[cost\_intel] Code generation tasks: where frontier models are genuinely irreplaceable vs where smaller models suffice
Use frontier models \(Sonnet, GPT-4-class\) for multi-file code changes, debugging complex systems, implementing features requiring business logic understanding, and any code requiring cross-dependency reasoning. Use smaller models \(Haiku, GPT-4o-mini\) for boilerplate generation, single-function implementation with clear specs, docstring writing, and test generation for straightforward cases.
Journey Context:
The quality difference between frontier and smaller models is most extreme in code generation. Smaller models produce code that compiles but contains subtle logic errors: off-by-one errors in business logic, incorrect state management, missing edge cases, and failure to maintain invariants across function boundaries. The signature of smaller model code failure: correct syntax, correct API usage, wrong semantics. On SWE-bench style evaluations, frontier models produce merge-ready code on roughly 2-3x more tasks than smaller models. But for single-function implementations with clear I/O specifications, smaller models match frontier models within a few percentage points. The cost difference: frontier models are roughly 20x more expensive per token. The mistake is using frontier models for all code tasks—routing boilerplate to Sonnet wastes 95% of your code budget. Route based on task complexity: if the change touches more than 1 file or requires understanding existing codebase patterns, use frontier. If it is a self-contained function with clear I/O, use the cheaper model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:06:28.739170+00:00— report_created — created