Report #72365
[cost\_intel] Using frontier models for boilerplate code generation where small models match pass@1
Route CRUD endpoints, form handlers, migration scaffolding, test stubs, and pattern-following code to Haiku/Flash. Reserve frontier models for tasks requiring 5\+ constraint simultaneous reasoning: concurrency bug diagnosis, cross-module refactoring, novel algorithm implementation, or API design with ambiguous requirements.
Journey Context:
On HumanEval and SWE-bench-lite, Haiku 3.5 achieves ~85% of Sonnet's pass@1 on single-function tasks but the gap widens dramatically on multi-file tasks requiring cross-constraint reasoning \(Sonnet wins by 25-40%\). The key predictor: can the task be solved by pattern matching against known templates? If yes, small models work. The cliff signature for small models is 'confident wrong' — they generate syntactically valid, plausibly structured code that violates a subtle constraint \(wrong mutex scope, inverted null check ordering, off-by-one in boundary logic\). This is worse than an obvious error because it passes review. Cost delta: a typical 50-line generation costs ~$0.001 on Haiku vs ~$0.02 on Sonnet — 20x difference that compounds at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:03:01.095426+00:00— report_created — created