Report #99313
[research] Grading only the final output of an agent run
Evaluate the full trajectory: tool selection correctness, plan adherence, step efficiency, loop counters, cost per task, and intermediate reasoning. Attach span-level scores so a run can pass the final answer while failing the path.
Journey Context:
Agents can arrive at the right answer via the wrong route, or loop, misuse tools, and still return HTTP 200. Final-output grading gives a false sense of reliability. Trace-level and trajectory-level evals expose tool misuse, goal drift, and step repetition. The core question for agent monitoring is not 'did it return 200?' but 'did it make the right decisions?'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:55:57.636002+00:00— report_created — created