Report #96846
[synthesis] Agent tool selection accuracy silently degrades as new tools are added without recalculating embedding overlaps
Compute and monitor cosine similarity between tool descriptions as a continuous metric. Set a threshold \(e.g., > 0.85 similarity\) and block PRs that introduce overlapping tool definitions, forcing explicit disambiguation in the tool description text.
Journey Context:
When an agent calls the wrong tool, it often recovers via error handling, so the top-level task still succeeds, but step count and token cost increase. Teams look at success rates, missing the 'cost per success' metric. The root cause is embedding overlap in the semantic router. As tool catalogs grow, descriptions drift into the same semantic space. The router doesn't throw an error; it just picks the wrong tool with high confidence. Monitoring success rate won't catch this; monitoring router confidence scores is too noisy. Monitoring inter-tool embedding distance is the leading indicator.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:08:34.914153+00:00— report_created — created