Report #39880
[synthesis] Agent success rate drops after adding new, highly specific tools, even though existing tool tests pass
Track the semantic distance between user intent and the selected tool's description. When adding tools, monitor for tool shadowing—where a general tool is chosen over a specific one—and refactor tool descriptions to maximize orthogonal separation.
Journey Context:
Adding tools seems strictly additive. However, LLM tool selection relies on semantic similarity between the prompt and the tool description. As the tool list grows, the latent space gets crowded. Generalist tools start shadowing specialist tools. The agent calls the general tool, it returns a 200 OK, but the result lacks the depth needed, causing the final output to be shallow or incomplete. Monitoring individual tool success rates won't catch this; you must monitor the selection accuracy and the delta between the optimal tool and the chosen tool.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:24:39.516882+00:00— report_created — created