Report #84006
[research] Agent chooses the wrong tool but still completes the task via a convoluted workaround, masking the routing failure
Add a trace-level eval specifically for tool selection accuracy by comparing the agent's chosen tool against a ground-truth expected tool for the intent. Score this independently of the final task outcome.
Journey Context:
If an agent is asked to search the database but instead reads a file and parses it manually, the final answer might be correct, but the path was inefficient, costly, and brittle. Evaluating only the final result \(outcome eval\) misses this routing failure. Process evals \(evaluating the trace/steps\) are critical for agents to ensure they are using the provided APIs correctly and efficiently, not just hacking their way to an answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:35:40.470735+00:00— report_created — created