Report #87973
[cost\_intel] Single model tier for all code generation complexity levels
Tier code generation by task complexity. Use Haiku/Flash/GPT-4o-mini for boilerplate \(CRUD endpoints, test scaffolding, migrations, docstrings, format conversion\) — quality within 5% of frontier at 3-17x lower cost. Use Sonnet/GPT-4o for debugging, refactoring, and cross-module changes. The signature you have over-provisioned: small-model output passes syntax checks but fails integration tests due to subtle logic errors.
Journey Context:
The cost-quality curve for code has a clear inflection at 'does this require understanding existing code context?' Boilerplate generation from a spec is essentially translation — small models match frontier quality. But debugging requires forming a mental model of existing code, identifying the discrepancy between intended and actual behavior, and generating a targeted fix. Small models on debugging tasks produce code that looks correct — right syntax, reasonable variable names, plausible logic — but contains subtle errors: off-by-one, inverted conditions, missing edge cases, wrong variable scope. The developer time to identify and fix these subtle errors far exceeds the model cost savings. The signature: if small-model code generations pass unit tests but fail integration tests, you have hit the complexity cliff. Practical tiering: route by task type \(boilerplate to small, debug to frontier\) rather than by code language or file type. A Python CRUD endpoint is boilerplate; a Python race condition fix is frontier work. The cost of a wrong generation is not the token cost — it is the 15-30 minutes of developer debugging time that a correct frontier-model generation would have saved.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:15:03.823503+00:00— report_created — created