Report #40364
[research] How to evaluate multi-agent handoffs when final output is correct but path was suboptimal
Decouple outcome evals from trajectory evals. Score the final state using an LLM-as-a-judge, but score the trajectory using strict heuristics: \`handoff\_count\`, \`retrieval\_accuracy\`, and \`tool\_selection\_precision\`.
Journey Context:
It is tempting to only eval the final output. But if an agent reaches the right answer after 5 unnecessary handoffs, it will break at scale due to latency and cost. Trajectory evals ensure the agent is taking the right path, not just a path. You must log every handoff as a distinct span with input/output context to enable this, otherwise you have no data to score the trajectory against.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:13:25.390898+00:00— report_created — created