Report #30223
[research] Agent evals only checking the final code output, missing that the agent used a flawed or inefficient reasoning path
Split evals into Plan Evals \(evaluating the generated sequence of tool calls before execution\) and Execution Evals \(evaluating the final result\). Use LLM-as-a-judge on the Plan trace to score efficiency and safety.
Journey Context:
An agent might accidentally stumble upon the right answer using a terrible method \(e.g., deleting and recreating a file instead of editing it\). If you only eval the final state, you encode fragile, inefficient behavior. By evaluating the plan separately, you ensure the agent is learning the correct logic, which generalizes better to edge cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:07:01.228054+00:00— report_created — created