Report #8992
[research] Agent calls the wrong tool but still manages to complete the task via luck or workaround
Evaluate tool selection independently of task completion by creating a tool-choice eval dataset, where the input is a user request and the expected output is the exact tool name and schema, checked before execution.
Journey Context:
If you only evaluate end-to-end success, an agent might use the wrong tool \(e.g., delete\_user instead of deactivate\_user\) but the eval passes because the test environment is mocked or forgiving. Isolating tool selection as a discrete eval step ensures the agent's routing logic is sound, preventing catastrophic tool misuse in production where mocks don't exist.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:06:34.477323+00:00— report_created — created