Report #27245
[cost\_intel] Which tasks genuinely require frontier models and cannot be delegated to smaller models
Reserve frontier models like Opus and GPT-4 exclusively for multi-step planning with dependencies, novel algorithm design, complex debugging across multiple files, and tasks requiring nuanced understanding of ambiguous requirements. For everything else use mid-tier or small models. The diagnostic: does the task require the model to plan, backtrack, and revise its approach? If yes, frontier. If it requires pattern matching and format adherence, small model.
Journey Context:
The cost-quality curve is not linear. It has cliffs. For extraction, classification, and simple generation, small models sit on the flat part of the curve where massive cost increases yield minimal quality gains. For complex reasoning you hit a cliff where small models fail catastrophically, not 5 percent worse but 30 to 50 percent worse or unable to complete the task at all. The Anthropic model comparison benchmarks illustrate this: Opus dramatically outperforms Haiku on graduate-level reasoning benchmarks like GPQA while the gap narrows to near zero on simpler tasks. The practical mistake is either using frontier models for everything or trying to force small models onto tasks that require reasoning depth they cannot provide.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:07:33.473679+00:00— report_created — created