Report #97931
[research] Final-answer-only scoring hides where the agent went wrong
Use three evaluation levels together: final-response evals for outcome, trajectory evals for path correctness, and single-step evals for tool selection and argument accuracy. Score each with specialized graders.
Journey Context:
Final-answer scoring tells you what failed but not why. Trajectory scoring locates the wrong turn; single-step scoring isolates the bad decision. Most production systems need all three: final response says the meeting was not scheduled, trajectory shows the wrong tool was called, single-step shows the date argument was malformed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:57:08.021482+00:00— report_created — created