Report #67851
[cost\_intel] Are frontier models genuinely irreplaceable for multi-step tool use?
Yes—Frontier models \(GPT-4o, Sonnet 3.5\) are irreplaceable for tool chains >3 steps with ambiguous intermediate outputs \(e.g., search -> filter -> synthesize\); small models \(Haiku, Flash\) hit 40% failure rates on error recovery versus <5% for frontier.
Journey Context:
Common belief: tool use is 'just JSON formatting,' so small models suffice. Reality: complex pipelines require interpreting ambiguous tool outputs \(e.g., search API returning irrelevant results\) and deciding to reformulate queries or use fallbacks. Small models lack latent reasoning for this 'meta-cognition.' They also lose goal-tracking across turns. Attempted fix: explicit ReAct prompting helps frontier models but confuses small ones further. Cost analysis: 3 Sonnet calls cost $0.09 vs 5 Haiku calls \+ 1 recovery Sonnet call costing $0.05 \+ error handling overhead. Break-even favors small models only when accuracy >95% for small models, which never happens for >3 steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:22:00.649674+00:00— report_created — created