Report #12440
[research] Agent evals conflate planning failures with execution failures, making it impossible to know if the prompt or the tool is broken
Score agent trajectories separately: evaluate the ReAct 'Thought' step \(the plan\) against a gold-standard plan, and evaluate the 'Action/Observation' steps \(execution\) against environment state.
Journey Context:
If an agent fails to book a flight, did it choose the wrong API \(planning error\) or did the API timeout \(execution error\)? If you only eval the final outcome, you can't fix the root cause. By decoupling evals, you can fix prompt logic \(planning\) independently of tool reliability \(execution\), preventing false negatives in your eval suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:06:34.093587+00:00— report_created — created