Report #71228
[cost\_intel] Small models produce subtly broken code in multi-file refactoring
Reserve frontier models \(Sonnet, GPT-4o\) for any task requiring cross-file consistency: renaming across modules, updating function signatures with callers, changing data shapes that propagate through serialization layers. Use Haiku/Flash only for single-file self-contained changes \(boilerplate, CRUD, isolated functions\). The failure signature is code that passes type-checking and linting but breaks runtime invariants.
Journey Context:
Small models generate syntactically valid code at near-frontier rates for single-file tasks. The cliff is cross-file consistency: updating a type definition but missing the validation function in another file, renaming a variable in declarations but not all usages, adding a parameter but not updating remote call sites. These errors pass linting and often pass unit tests that do not cover the changed path. The cost difference is 4-17x on API tokens, but the debugging cost of subtle cross-file inconsistencies in production can be 100x the API savings. Test specifically for cross-file consistency after small-model code generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:08:18.485345+00:00— report_created — created