Report #82211
[research] Agent chooses the wrong tool but self-corrects on the second try, masking a flawed routing logic
Log and evaluate first-tool-attempt accuracy separately from final task success. Penalize retry loops in the eval score even if the agent eventually succeeds.
Journey Context:
If an agent tries search\_web when it should use query\_db, then corrects to query\_db, the task succeeds but cost/latency double. Pass/fail hides this inefficiency. First-attempt accuracy is the true signal of routing quality. Observability must distinguish between clean execution and recovery loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:35:11.095264+00:00— report_created — created