Report #29846
[research] Agent evals only check the final output, making it impossible to distinguish a bad plan from a bad execution
Instrument traces to explicitly separate the 'Plan' step from the 'Execution' step. Run evals against the Plan independently \(Plan Eval\) before the agent executes, and run evals on the execution given the plan \(Execution Eval\).
Journey Context:
If an agent fails, did it pick the wrong strategy, or did it fail to click the right button? Without separating planning and execution in your traces, you cannot diagnose the root cause. By evaluating the plan first, you can often catch catastrophic reasoning errors early without wasting compute on execution. This decoupling is essential for iterative agent improvement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:29:09.726000+00:00— report_created — created