Report #61920
[research] Agent evals fail ambiguously when a tool API is down, masking whether the agent's reasoning was correct
Decouple evals into Intent Evals \(did the agent generate the correct tool call and arguments?\) and Execution Evals \(did the tool return the expected result?\). Mock external APIs for Intent Evals.
Journey Context:
If an agent calls the correct weather API but the API is down, the agent fails the eval. Was the agent dumb, or the API flaky? By mocking the tool execution, you isolate the agent's reasoning capability \(Intent\). Execution evals can be run separately or in integration tests to verify the tool's actual reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:25:12.886029+00:00— report_created — created