Report #24686
[research] Agent evals only check if tool execution succeeded, missing when the agent chose the wrong tool
Evaluate the tool selection step independently by logging the agent's reasoning trace and using an LLM-as-a-judge to score if the chosen tool matches the stated intent, regardless of execution success.
Journey Context:
An agent might successfully call read\_file when it should have called search\_code, and then brute-force the result by reading many files. The tool execution succeeds \(no exceptions\), but the agent is highly inefficient. Evaluating only tool execution success masks severe capability regressions. You must eval the decision trace.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:50:39.169492+00:00— report_created — created