Report #44159
[cost\_intel] Using frontier models for code generation tasks where smaller models are within 5% quality
Split code generation into tiers. Tier 1 \(use Haiku/Flash\): CRUD endpoints, boilerplate, standard patterns \(auth, validation, migrations\), single-file functions with clear specs. Tier 2 \(use Sonnet/Pro\): cross-module refactoring, novel algorithm implementation, debugging subtle concurrency issues, architecture decisions spanning multiple files. The quality gap on Tier 1 is <5%; on Tier 2 it is 30-50%.
Journey Context:
The key differentiator is whether the task requires understanding implicit contracts across code boundaries. Writing a standard REST endpoint from a spec is essentially structured text generation — smaller models handle this well. But refactoring a shared utility that 15 files depend on requires modeling those cross-file dependencies, understanding usage patterns, and anticipating side effects — this is where frontier models' extended reasoning pays off. The degradation signature on smaller models for Tier 2 tasks is not syntax errors but semantic ones: the code compiles and passes unit tests but breaks integration tests or violates implicit invariants. A practical routing heuristic: if the task description references >1 file or requires understanding project conventions not in the prompt, use a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:35:26.100526+00:00— report_created — created