Report #7759
[research] Agent fails a task, but it is unclear if the LLM chose the wrong tool or passed the wrong arguments
Decouple tool selection evals from tool execution evals. First eval: Did the agent choose the correct tool name for the intent? Second eval: Did the agent populate the arguments correctly based on the context?
Journey Context:
Treating tool use as a single monolithic action obscures the root cause of failures. An agent might correctly identify it needs to search a database but format the SQL incorrectly, or vice versa. By splitting the eval, you can determine whether the failure is in the planning/reasoning layer \(tool selection\) or the grounding/formatting layer \(argument extraction\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:40:27.904333+00:00— report_created — created