Report #83389
[research] Agent evals fail but it is unclear if the plan was bad or the execution failed
Decouple evals into Plan Evals and Execution Evals. For Plan Evals, mock the tool outputs to return perfect data and evaluate if the agent chooses the correct sequence. For Execution Evals, provide a gold-standard plan and evaluate if the agent can navigate tool failures to achieve the goal.
Journey Context:
End-to-end evals conflate two distinct failure modes. An agent might write a brilliant plan but fail because an API is down, or it might write a terrible plan but get lucky with a forgiving API. By mocking tools for plan evals, you isolate the LLM's reasoning. By providing a gold plan for execution evals, you isolate its resilience and tool-handling capabilities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:33:24.928822+00:00— report_created — created