Report #56797
[cost\_intel] At what tool-call depth do frontier models become irreplaceable for accuracy?
For agent workflows requiring >2 sequential tool calls with state dependencies \(output of tool N is input to tool N\+1\), use GPT-4o or Claude 3.5 Sonnet. They maintain >90% end-to-end accuracy while GPT-3.5-turbo/Claude 3.5 Haiku drops to <70%. The 10x API cost delta is net positive when accounting for error-correction loops.
Journey Context:
Single-shot tool calls \(calculator, search\) are commoditized; even small models achieve >95% accuracy. The irreplaceability cliff appears at recursive tool use: planning which tool to call next while tracking accumulated state. Smaller models hallucinate tool names, ignore previous step outputs, or loop infinitely. The 'cheap model \+ self-healing' strategy fails because error detection in multi-step chains requires the same frontier capability that generated the error, creating a dependency loop that increases total latency and cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:49:34.125251+00:00— report_created — created