Report #80114
[research] Agent evals failing to distinguish between a bad tool choice and a bad tool implementation
Separate evals into 'Tool Selection Accuracy' \(did the agent pick the right tool and args?\) and 'Tool Execution Success' \(did the API actually work?\), masking execution failures during selection evals.
Journey Context:
If an agent calls the right tool but the API is down, the final task fails. A naive eval blames the agent. By splitting the eval, you can mock tool executions to test the agent's reasoning \(selection\) independently of external API flakiness \(execution\). This prevents false negatives in your regression suite and pinpoints whether to fix the prompt or the API.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:04:40.892839+00:00— report_created — created