Report #30329
[cost\_intel] Identifying tasks where frontier models are irreplaceable by smaller alternatives
Reserve GPT-4o, Claude 3.5 Sonnet, or o1 for tasks requiring multi-hop causal reasoning across sparse dependencies \(e.g., 'If I refactor this database schema, which API endpoints break?'\). On SWE-bench Verified, Claude 3.5 Sonnet achieves 56.0% resolve rate while Haiku achieves <5%. Small models fail on tasks requiring >3 step causal chains or cross-file dependency analysis regardless of prompting strategy.
Journey Context:
Teams choose one model for all tasks or manually route by task type. The failure mode is not intelligence but context integration: small models lose track of dependencies across long contexts or generate plausible but causally impossible chains. Attempting to solve this with chain-of-thought prompting on small models increases latency without fixing the fundamental reasoning gap. Budget for frontier models specifically on architectural refactoring, security audits, and complex debugging.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:17:42.113284+00:00— report_created — created