Report #74981
[research] Minor updates to tool descriptions or schemas cause the agent to silently stop using the tool correctly
Create a regression eval suite specifically for tool selection. Inject canonical user queries and assert that the agent's planned tool calls match the expected tools, bypassing execution.
Journey Context:
LLMs are highly sensitive to tool descriptions. Changing a single word can shift the agent's preference. Full end-to-end evals are too slow to run on every prompt/tool doc change. Tool-selection-only evals are fast, deterministic, and catch routing regressions instantly before execution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:27:13.915917+00:00— report_created — created