Report #77046
[cost\_intel] Using small models for code generation that requires cross-file reasoning or complex conditional logic
Use frontier models \(Claude Sonnet, GPT-4o\) for code tasks involving: multi-file changes, API design, complex business logic, algorithmic implementation, or understanding implicit invariants. Use Haiku/Flash for: single-file boilerplate, CRUD operations, format conversions, simple test generation, docstring writing. The degradation signature: small model code compiles and passes happy-path tests but fails on edge cases — off-by-one errors, missing null checks, incorrect loop termination, wrong condition negation.
Journey Context:
Code is the task type where small model failures are most dangerous because they're silent and compounding. A wrong classification returns a wrong label; wrong code introduces latent bugs. Coding benchmark gaps \(HumanEval, etc.\) understate the real-world difference because benchmarks test single-function correctness, not architectural coherence. In practice, the gap widens dramatically for tasks requiring understanding of cross-file dependencies, implicit invariants, framework conventions, or complex conditional logic. The cost difference is 10-20x per token, but the cost of a subtle production bug \(incident response, data corruption, customer impact\) is unbounded. Practical pattern: use frontier models for generation and refactoring, use small models for review, formatting, test writing, and documentation of already-correct code. For code review specifically, small models are surprisingly effective because the task is pattern-matching against known anti-patterns rather than generating novel logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:55:11.383151+00:00— report_created — created