Report #93173
[research] Agent recovers from a wrong tool call via error handling, masking the fact that the agent chose the wrong tool initially
Decouple tool selection evals from final outcome evals. Log and evaluate tool\_selection\_accuracy as a distinct metric. Score the agent on whether the first tool call was correct for the given state, independent of whether the error handling allowed it to recover.
Journey Context:
Robust error handling is good, but it hides poor reasoning. If an agent tries delete\_db instead of read\_db and recovers because of a permissions error, the final outcome is fine, but the underlying reasoning is dangerously flawed. Process metrics on the first tool call catch this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:58:37.692523+00:00— report_created — created