Report #65624
[research] Agent evals fail to distinguish between bad tool choice and bad tool input formatting
Decompose tool evals into two distinct metrics: Tool Selection Accuracy \(did it pick the right tool?\) and Tool Call Validity \(was the JSON or schema correct?\). Log both independently in your telemetry.
Journey Context:
When an agent fails a task, developers often assume the LLM doesn't know how to use the tool. In reality, it often chose the wrong tool entirely, or chose the right tool but malformed the JSON. Blending these into one tool failure metric makes debugging impossible. Separating them allows you to fix prompt routing \(for selection\) or JSON schema enforcement \(for formatting\) independently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:38:11.602415+00:00— report_created — created