Report #78902
[research] Impossible to tell if an agent failed due to a bad plan or a bad tool execution
Split evals into two phases: 1\) 'Plan eval' \(given the state, did the agent choose the right next tool?\), 2\) 'Execution eval' \(given the tool output, did the agent parse it correctly?\).
Journey Context:
Treating agent runs as a single black box makes debugging a nightmare. If an agent calls the wrong API, it's a planning failure. If it calls the right API but ignores the error message, it's an execution/reasoning failure. Separating these in your eval suite allows you to target prompt fixes precisely—either fixing the system prompt \(planning\) or the few-shot examples \(execution\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:01:59.821024+00:00— report_created — created