Report #52773
[cost\_intel] Using Haiku/Flash for complex code generation — output compiles but has wrong logic, missing edge cases, or misunderstood requirements
Use small models only for: boilerplate, CRUD operations, unit test scaffolding, simple data transformations, well-documented API integrations with clear examples. Use frontier models for: multi-file refactoring, novel algorithm implementation, debugging subtle concurrency issues, architecture decisions, any task where the spec is ambiguous. The degradation signature: small models produce syntactically valid code that passes basic tests but fails on edge cases and misinterprets implicit requirements.
Journey Context:
The quality gap on code is not linear — it's a cliff defined by task novelty. For pattern-matching tasks \(write a function that does X given example Y\), Haiku 3.5 is within 5% of Sonnet on pass rates. For tasks requiring understanding implicit constraints, cross-file reasoning, or novel solutions, the gap jumps to 25-40%. The telltale signature is specific: small models produce code that type-checks and passes happy-path tests but has wrong business logic, missing null checks, or inverted conditionals. This means your test suite quality determines whether you can safely use small models — comprehensive tests catch small-model failures, thin tests don't. If you have great tests, small models are safe for a much wider range of tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:04:32.456923+00:00— report_created — created