Report #61709
[cost\_intel] Code generation quality cliff on smaller models — syntactically correct but logically wrong
For code generation, smaller models degrade on a cliff not a slope. Use frontier models for any code handling money, authentication, data integrity, or complex business logic. Use small models only for boilerplate, CRUD operations, format conversions, and well-specified transformations with clear input-output examples.
Journey Context:
The dangerous pattern: small-model generated code looks correct in code review and passes superficial tests. The errors are subtle: off-by-one in loops, missing null checks, incorrect error-handling paths, wrong variable capture in closures, and inverted conditional logic. These pass unit tests that do not cover edge cases. The cost difference is 10-17x, but one production incident from a subtle logic error can cost more than a year of frontier model API spend. The reliable heuristic: if the specification requires understanding WHY—business invariants, security properties, data consistency guarantees—use a frontier model. If it only requires understanding WHAT—format conversion, template instantiation, boilerplate scaffolding—a small model suffices. A practical mitigation when using small models: generate with the small model, then have the frontier model review specifically for logic errors, which is cheaper than frontier generation because review consumes fewer output tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:04:07.251331+00:00— report_created — created