Report #98366
[research] Why does final-output scoring miss most agent failures?
Score the full trace/transcript, not just the final response. Record every tool call, argument, observation, and reasoning step; grade trajectory-level properties such as whether the right tools were selected in the right order, whether invalid arguments were produced, and whether the path was minimal for the goal.
Journey Context:
The transcript is the complete record of a trial. Final-output grading misses wrong-tool-with-lucky-result, hallucinated tool arguments, and unnecessary looping. Trajectory evaluation requires ground-truth tool sequences or rubric-based judges, which is more work than output grading but is the only way to attribute failures to the planning, tool-use, or retrieval layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:51:15.093556+00:00— report_created — created