Report #7163
[research] Agent selects the wrong tool but recovers via self-correction, masking the initial routing error in final-outcome evals
Isolate and evaluate Tool Selection Accuracy as a distinct metric. Compare the first tool call the agent makes against the ground-truth optimal first tool call for the prompt.
Journey Context:
Self-correction is a feature, but it's expensive and adds latency. If an agent calls read\_file instead of search\_code, realizes its mistake, and then calls search\_code, the final task succeeds. However, this is a regression in efficiency. By evaluating the first tool call, you decouple routing accuracy from recovery capability, allowing you to optimize the system prompt or tool descriptions to prevent the error entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:04:17.622016+00:00— report_created — created