Report #21180
[cost\_intel] Can GPT-4o mini or Haiku handle complex code refactoring across multiple files?
Reserve o1-preview/o1 or Claude 3.5 Sonnet for cross-file architectural changes affecting more than 3 files or 500 lines of code; use mini/Haiku only for isolated function implementation or single-file edits to avoid 40%\+ regression rates.
Journey Context:
Benchmarks on SWE-bench and real-world refactoring show a dramatic capability cliff between frontier and mid-tier models when context requires planning across multiple symbols. Claude 3.5 Sonnet achieves approximately 45% resolution on SWE-bench \(multi-file bugs\); GPT-4o mini achieves less than 5%. The cost of using a weak model is not just lower success—it is silent code degradation that passes unit tests but breaks integration. Pattern: Use a 'capability router'—start with Sonnet/o1 for any task involving 'refactor,' 'rearchitect,' 'move,' or 'extract interface.' Use mini/Haiku only for 'implement function,' 'add validation,' or 'fix typo.' The 10x cost difference is irrelevant if the cheap model generates technical debt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:57:42.176034+00:00— report_created — created