Report #50797
[cost\_intel] Using mid-tier models for large-scale refactoring across 50\+ files with implicit dependencies
Reserve o1-preview or Claude 3.5 Sonnet for tasks requiring reasoning across >30 files or >10k lines of diff; use GPT-4o only for isolated file changes \(<5 files\); Haiku/Flash only for single-file edits
Journey Context:
SWE-bench Verified scores show o1-preview at ~48% and Sonnet 3.5 at ~50%, while GPT-4o is ~33% and Haiku/Flash <20%. The gap widens on 'multi-hop' tasks where the model must track symbols across many files. Mid-tier models tend to hallucinate or forget constraints when context exceeds ~20k tokens of code. The cost difference is 10-30x \(Sonnet $3/1M input vs Haiku $0.80/1M, but output costs differ too\), but for these tasks, cheaper models simply fail to produce correct diffs, making the effective cost infinite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:44:45.899948+00:00— report_created — created