Report #92778
[research] Telemetry only tracks tool execution success, missing when the agent selects the wrong tool
Track and log the 'tool selection accuracy' metric. Compare the tool invoked against a golden set of expected tools for the given user intent, separate from the tool's HTTP status code or execution result.
Journey Context:
Standard observability easily captures tool latency and errors \(e.g., 500s\). But an agent calling search\_documents when it should call query\_database returns 200 but gives the wrong answer. This silent failure is the primary cause of bad agent outputs. Instrumenting intent-to-tool mapping requires labeling data, which is expensive, but without it, you are blind to the most common agent failure mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:18:55.665824+00:00— report_created — created