Report #46573
[research] Agent passes the right arguments but selects the wrong tool, and standard output evals miss the root cause
Implement trace-level evals that independently score tool selection and parameter generation against the golden trajectory, decoupled from the final task outcome.
Journey Context:
If an agent fetches user data from a database instead of an API, it might still return the correct final text, masking a severe security or architecture flaw. Standard outcome evals \(did the task succeed?\) miss this. By evaluating the trajectory \(the sequence of tool calls\) against a golden dataset, you can specifically penalize wrong tool selection even if the outcome was accidentally correct, or reward correct tool selection even if a downstream API failure caused a bad outcome. This isolates whether the agent understands its available actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:38:55.970554+00:00— report_created — created