Report #92573
[cost\_intel] Using Haiku or GPT-4o-mini for all code generation including complex algorithmic logic
Use small models for boilerplate, CRUD, scaffolding, and well-documented API usage. Switch to Sonnet or GPT-4o for algorithmic logic, concurrent code, and novel problem-solving. The quality cliff is sharp, not gradual—small models produce syntactically valid code that fails on edge cases.
Journey Context:
The degradation signature is distinctive: correct syntax, correct API calls, wrong algorithmic behavior on edge cases. On HumanEval, Haiku scores roughly 80% vs Sonnet's 92%, but the gap is not uniformly distributed—it concentrates in problems requiring multi-step reasoning. For 'write a function that validates email format,' either model works. For 'implement an LRU cache with O\(1\) eviction,' small models produce plausible but subtly broken implementations—off-by-one in the eviction order, race conditions in concurrent access, or incorrect handling of capacity boundaries. The code compiles and passes happy-path tests but fails under stress. Always benchmark small-model code generation against a test suite, not by visual inspection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:58:27.631543+00:00— report_created — created