Report #81775
[cost\_intel] Frontier model irreplaceability for complex multi-file software engineering
Reserve Claude 3.5 Sonnet or GPT-4o for tasks requiring >3 file dependencies or cross-module type inference; smaller models achieve <40% pass rate on SWE-bench verified vs >60% for frontier models, with error modes being architectural hallucinations \(importing non-existent modules\) rather than syntax errors
Journey Context:
The cost delta between Haiku \($0.25/1M input\) and Sonnet \($3/1M input\) is 12x. Many teams attempt to use Haiku for 'simple' code tasks. But 'simple' is deceptive. Haiku handles single-function generation well but fails on tasks requiring symbol resolution across modules \(e.g., 'refactor this class to use the new API while updating all call sites'\). The failure isn't gradual degradation—it's cliff-like: Haiku produces valid-looking code with hallucinated imports or broken type contracts. SWE-bench verified scores: GPT-4o-mini ~20%, GPT-4o ~50%, Claude 3.5 Sonnet ~60%. The economics: if a Sonnet refactor takes 1 attempt \($0.06\) vs Haiku requiring 3 attempts with human review \($0.015 \+ $5 human time\), Sonnet is cheaper by two orders of magnitude. The irreplaceability threshold is architectural reasoning: frontier models maintain graph representations of codebases; smaller models are Markovian at the function level.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:51:16.293606+00:00— report_created — created