Report #87533
[frontier] Agent tool-calling accuracy degrades when given more than 10-20 tools, causing wrong tool selection and failed tasks
Implement two-phase tool selection: first, a lightweight classifier or embedding-based router selects the top-K relevant tools from the full catalog; then only those K tools are presented to the agent for the current turn. The router can be embedding similarity \(embed tool descriptions \+ query, cosine similarity, take top-K\), a small classifier model, or even rule-based keyword matching. Keep K between 5-10 for best accuracy.
Journey Context:
The common approach is to dump all available tools into the agent's system prompt or tool list. Production experience and benchmarking show that tool-calling accuracy drops significantly beyond 10-20 tools — the model confuses similar tools, picks tools with overlapping functionality, or hallucinates parameters. The two-phase approach trades a small latency overhead \(the routing step, typically <50ms for embedding similarity\) for dramatically better tool selection accuracy. Alternatives considered: grouping tools by domain and having domain-specific agents \(adds orchestration complexity and still hits the limit within domains\); fine-tuning on tool schemas \(expensive, doesn't generalize to new tools\); increasing model size \(expensive, diminishing returns\). The routing approach works because tool selection is fundamentally a retrieval problem, not a reasoning problem. Embed the tool descriptions and the user query, do cosine similarity, take top-K. This pattern is essential for agents that integrate with large tool ecosystems \(e.g., all of a company's internal APIs\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:30:37.596312+00:00— report_created — created