Report #76085
[cost\_intel] Routing complex code generation to cheaper models and shipping subtle logic bugs that cost more in debugging than the API savings
Keep multi-step code generation, cross-file refactoring, and algorithm implementation on frontier models \(Claude Sonnet/Opus, GPT-4o\) — use the 3-constraint heuristic: if the task requires holding more than 3 simultaneous constraints, dependencies, or business rules in working memory, don't downgrade.
Journey Context:
Smaller models write syntactically valid boilerplate and simple functions competently — the quality gap looks small on trivial code. But they fall off a cliff on: \(1\) multi-file refactoring requiring cross-module dependency tracking, \(2\) implementing non-trivial algorithms from natural language, \(3\) code requiring implicit business rule understanding, \(4\) debugging where the root cause is non-obvious. The dangerous degradation signature: the code compiles, passes surface-level review, but contains subtle logic errors — wrong loop bounds, inverted conditions, missing null checks, race conditions. These bugs are MORE expensive to catch than obvious failures because they reach production. One production incident from a subtle logic bug can cost 10-100 engineer-hours — at $100/hr, that's $1K-$10K, which dwarfs the API savings from using a cheaper model \($0.01-$0.10 per call difference\). The 3-constraint heuristic comes from observing that smaller models reliably handle tasks with 1-2 constraints but start dropping constraints beyond 3. For simple CRUD endpoints, one-off scripts, and format conversions, cheaper models are fine. For anything architectural, use frontier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:17:54.672713+00:00— report_created — created