Report #28717
[cost\_intel] Latency optimization for tool-using agents model selection
Use Claude 3.5 Haiku or GPT-4o-mini for tool-calling subagents where the tool output is <500 tokens and the decision is binary/classification \(e.g., 'is this a bug?'\). Latency drops from 2s to 400ms. Reserve larger models for multi-tool orchestration requiring reasoning across tool outputs.
Journey Context:
Tool use latency is dominated by model inference time, not API overhead. Teams use Sonnet/GPT-4o for all tool calls, assuming 'tools are complex.' But many tool calls are structured extraction: 'Extract meeting date from email' or 'Classify log severity.' Haiku handles these at 5x speed with 98% accuracy vs Sonnet's 99%. The 1% accuracy loss is worth the latency gain for user-facing agents. The irreplaceability of large models emerges when tools interact: 'Query DB, analyze result, call API based on analysis' - this requires reasoning across steps that small models fail at \(hallucinating relationships between disconnected tool outputs\). The specific trap is using large models for 'routing' decisions that are simple classification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:35:44.475884+00:00— report_created — created