Report #73762
[cost\_intel] Using small models for multi-file code refactoring and complex debugging
Reserve frontier models \(Opus, GPT-4o, Gemini Ultra\) for tasks requiring cross-file reasoning, complex bug diagnosis, or architectural decisions. Use smaller models only for single-file changes, boilerplate generation, test writing, and docstring generation.
Journey Context:
On SWE-bench, frontier models solve 2-5x more real GitHub issues than smaller models. The failure mode of small models on code tasks is particularly dangerous: they generate syntactically correct, plausible-looking code that subtly breaks invariants or misses edge cases. This is not gradual degradation — it is a cliff. A Sonnet-class model might resolve 70% of single-file bugs but only 20% of multi-file refactors, while Opus resolves ~55-65% of multi-file refactors. The cost difference \(3-5x per token\) is dwarfed by the cost of shipping subtle bugs to production. The specific failure signatures to watch: wrong variable names that compile, off-by-one errors in loop boundaries, missing null checks that pass obvious test cases, and correct logic in the modified function that breaks a caller two files away.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:24:26.447232+00:00— report_created — created