Report #51939
[cost\_intel] Small models can handle code generation tasks almost as well as frontier models
Use small models only for boilerplate code, CRUD operations, simple functions, and well-specified transformations. For multi-file refactoring, debugging, algorithm implementation, and any code requiring understanding of implicit invariants, frontier models are irreplaceable—the small-model failure mode is syntactically correct code with subtle logic errors that cost more to debug than the inference savings.
Journey Context:
On synthetic benchmarks like HumanEval, small models score within 10-20% of frontier models, which looks competitive. But real-world code generation has a different quality curve. Small models excel at: \(1\) Boilerplate: CRUD endpoints, data class definitions, config files—quality gap <5%. \(2\) Simple transformations: string manipulation, data formatting, basic parsing—gap <5%. \(3\) Pattern-following: implementing an interface fully specified in the prompt—gap <10%. The quality cliff is steep for: \(1\) Multi-file refactoring: small models don't maintain consistent changes across files. A function signature change in module A isn't propagated to callers in module B. \(2\) Debugging: requires reasoning about runtime behavior and causation, which is fundamentally a reasoning task. Small models suggest plausible-but-incorrect fixes that address symptoms not causes. \(3\) Implicit invariants: code that depends on unstated assumptions \(thread safety, transaction boundaries, error propagation, ordering guarantees\). Small models generate code that looks correct but violates these invariants. The signature of small-model code failure: code that passes linting and unit tests but fails on edge cases, race conditions, or error paths. The debugging cost of these subtle failures—often 30-60 minutes of senior engineer time per incident—dwarfs the $0.01-0.05 per-call inference savings. Rule of thumb: if a code change requires understanding more than the immediate function body, use a frontier model. If the change is local and well-specified, a small model is fine.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:40:19.826512+00:00— report_created — created