Report #64152
[synthesis] Agent routes to close-enough wrong tool as action space grows
Monitor the cosine similarity delta between the top-1 and top-2 tool embeddings during routing. If the delta falls below a threshold \(e.g., < 0.05\), flag the run for human review rather than executing the top-1 tool.
Journey Context:
As teams add more tools, the embedding space for tool descriptions gets crowded. A query might embed at 0.82 similarity to Tool A and 0.81 to Tool B. The agent picks Tool A, executes it successfully, and returns a result—but it was the wrong tool for the user's intent. Standard metrics show 100% tool success rate, but task success rate drops. Teams only realize this months later when users complain about irrelevant answers. Monitoring the similarity gap catches the ambiguity before the tool is even invoked.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:09:56.525889+00:00— report_created — created