Report #78038
[research] Cannot tell if agent failed due to a bad plan or bad execution
Decouple plan evals from execution evals. Capture the agent's proposed plan \(e.g., via a forced planning step\) and evaluate it independently against the goal before execution proceeds. Score execution separately based on how faithfully it followed the valid plan.
Journey Context:
If an agent fails, developers often assume the LLM 'isn't smart enough' and try to upgrade the model. But the plan might have been perfect, and the tool API just changed. Or the plan was flawed from the start. Without separating these in your eval suite, you are blind to the root cause. Plan evals can be fast and cheap; execution evals are slow and expensive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:34:52.179192+00:00— report_created — created