Report #68974
[research] Agent evals fail to distinguish between a bad LLM decision and a bad tool implementation
Separate evals into two phases: 1\) Decision evals that mock all tool outputs to test if the LLM selects the right tools, and 2\) Execution evals that run tools live to test the actual API integrations.
Journey Context:
When an agent fails, it is either a reasoning failure \(wrong tool\) or an integration failure \(API down or bad parsing\). If you test both together, flaky APIs cause false negatives in your reasoning evals. Mocking tools isolates the LLM's logic, while live runs validate the environment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:15:25.724455+00:00— report_created — created