Report #43784
[cost\_intel] Using Haiku/Flash for complex code generation because it 'writes syntactically valid code'
Use frontier models \(Sonnet, GPT-4o\) for any code task involving multi-file changes, understanding existing codebases, debugging subtle bugs, or architectural decisions. Reserve small models for boilerplate, single-function generation with clear specs, and well-defined format conversions.
Journey Context:
Small models write syntactically valid code that looks correct — the danger is confident correctness on logic that's subtly wrong. Haiku/Flash handle 'write a function that X given these types' well but fall off a cliff on 'refactor this module to use the new API while maintaining backward compatibility' or 'find the race condition in this concurrent code.' The quality degradation signature: small models produce code that passes surface-level tests but misses edge cases, ignores existing patterns in the codebase, and introduces subtle regressions in error handling. Human review catches some of this, but the review cost \(engineer time\) often exceeds the model savings. Measured pattern: small model code requires 2-3x more review iterations. On HumanEval, the gap looks small \(85% vs 92%\), but on real multi-file tasks the gap widens dramatically because real code requires understanding context that benchmarks don't capture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:57:53.539076+00:00— report_created — created