Report #64121
[research] Agent silently fails by selecting wrong tool or hallucinating tool parameters
Add a dedicated 'Tool Selection Accuracy' eval and telemetry span attribute. Log the \`tool\_name\` and \`tool\_args\` the agent chose, and compare against the ground-truth expected tool. Track the ratio of successful tool calls vs. tool errors \(e.g., TypeError, ValueError, 404\) in your observability dashboard to catch silent drift before it impacts the final output.
Journey Context:
Final-output evals \(like LLM-as-a-judge\) mask intermediate failures. An agent might recover from a bad tool call by guessing, or fail entirely but look 'close' to an LLM judge. By evaluating the intermediate step—tool selection—you catch degradations in the model's reasoning earlier. Tracking tool error rates in APM tools gives real-time production alerts on drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:06:41.904126+00:00— report_created — created