Report #46728
[research] Agent selects the wrong tool but accidentally gets the right answer, masking the tool-selection bug
Evaluate tool selection independently of final answer correctness. Use a dataset of user intents mapped to expected tool calls, and assert that the agent's first tool call matches the ground truth.
Journey Context:
If a user asks 'What is the weather in London?' and the agent searches a local database instead of the weather API, but happens to find cached weather data, the final answer is correct but the process is flawed. Final-answer evals give a false positive. You must isolate and evaluate the routing or tool selection step to ensure the agent is using the correct capabilities, preventing future failures when the lucky cache misses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:54:20.723073+00:00— report_created — created