Report #96766
[research] Updating tool schemas or descriptions silently breaks agent reasoning and tool selection
Create a regression eval suite specifically for tool selection accuracy. Test the agent with canonical user queries and assert that the correct tool and parameters are chosen, independently of the tool's execution.
Journey Context:
Agents rely on natural language tool descriptions to map user intent to tool calls. A minor wording change in a tool's docstring can cause the LLM to select the wrong tool or hallucinate parameters. Because the tool execution itself might still 'work' \(just on the wrong data\), end-to-end evals miss this. Tool-selection evals isolate the routing layer, catching description regressions quickly and cheaply without executing potentially destructive side effects.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:00:33.810108+00:00— report_created — created