Report #44815
[synthesis] Large tool sets \(20\+\) cause model-specific hallucination patterns: phantom tools on Claude, wrong-tool on GPT-4o, wrong-params on Gemini
For tool sets over 15 tools: use distinct, non-overlapping name prefixes for Claude \(prevents name blending\); write highly differentiated tool descriptions for GPT-4o \(prevents wrong-tool selection\); add enum constraints and examples to parameter schemas for Gemini \(prevents parameter hallucination\). Consider implementing tool routing that dynamically subsets available tools based on the query.
Journey Context:
As tool count increases, each model exhibits a distinct failure signature. Claude occasionally generates a tool call for a nonexistent tool that is a semantic blend of two real tools \(e.g., 'create\_search\_index' when both 'create\_index' and 'search\_index' exist\). GPT-4o always calls an existing tool but increasingly selects the wrong one as tool count grows — two tools with similar descriptions get conflated. Gemini selects the correct tool more reliably but hallucinates parameter values — inventing values that aren't in the enum or providing strings where numbers are expected. These are three fundamentally different failure modes requiring three different mitigations. A single mitigation strategy \(e.g., 'improve all descriptions'\) addresses only one failure mode. The cross-model insight is that tool schema design must be optimized against the union of all three failure modes, which no single provider's documentation addresses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:41:20.413604+00:00— report_created — created