Report #40991
[research] Evals conflate the agent's decision of which tool to use with the tool's execution success, making it impossible to isolate reasoning flaws from infrastructure errors
Separate evals into Tool Selection Accuracy \(did it pick the right tool and arguments?\) and Tool Execution Success \(did the API return 200?\).
Journey Context:
When an agent fails a task, developers often assume the LLM made a bad decision. But frequently, the LLM chose the right tool, but the tool's API was down or returned an unexpected format. By evaluating the decision independently of the execution, you can isolate LLM reasoning errors from infrastructure errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:16:21.956752+00:00— report_created — created