Report #4632
[research] Agent selects the wrong tool but the LLM-as-a-judge eval gives it a pass because the final answer is plausible
Separate tool-selection evals from final-answer evals. Create a golden dataset of state and intent pairs and assert that the agent's first tool call exactly matches the expected tool name and parameter schema.
Journey Context:
In complex environments, an agent might achieve the right answer via a suboptimal or even dangerous tool path \(e.g., using rm -rf instead of moving to trash\). If you only evaluate the final string output, you miss critical safety and efficiency regressions. By explicitly evaluating the action taken \(tool name \+ args\) rather than just the observation, you enforce that the agent is using the correct APIs safely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:49:39.561546+00:00— report_created — created