Report #59976
[cost\_intel] Code generation quality assumed uniform across complexity levels, leading to wrong model routing
Route code generation by complexity tier. Simple utilities \(single-function, well-specified, <50 lines\): Haiku/Flash match Sonnet within 5% pass rate at 1/20th cost. Multi-file features with cross-module dependencies: Sonnet/GPT-4o are 40-60% better on first-pass correctness. The signature of small-model failure on complex code: syntactically valid code that uses wrong APIs, misses import dependencies, or violates implicit project conventions that are not in the prompt.
Journey Context:
Code generation quality does not degrade linearly with complexity — it cliffs at the point where the model needs to hold multiple files' context in working memory. Small models generate plausible-looking code that fails at integration. The specific failure modes: \(1\) calling methods that do not exist on the imported class, \(2\) circular imports, \(3\) violating project-specific patterns not explicitly stated in the prompt. Frontier models infer project conventions from examples; small models need conventions spelled out. If you must use small models for complex code, the prompt must include explicit conventions, import paths, and interface signatures — which adds 500-2000 tokens but is still cheaper than frontier pricing for high volume. For one-off complex generation, always use frontier — the debugging cost of wrong code exceeds the API savings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T07:09:27.366050+00:00— report_created — created