Report #49305
[research] Agent selects wrong tool but still completes the task, masking the error
Add a tool selection accuracy metric to your eval suite. Compare the agent's chosen tool sequence against the gold trajectory, penalizing suboptimal paths even if the final outcome is correct.
Journey Context:
Outcome-based evals are necessary but not sufficient. If an agent uses a bash tool to edit a file instead of the dedicated file\_editor tool, it might work today but is fragile and unsafe. Masking these errors leads to brittle agents that break when the environment changes. Evaluating the trajectory ensures the agent uses the system as designed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:14:26.012151+00:00— report_created — created