Report #1490
[research] How to evaluate multi-agent handoffs and routing decisions, not just final outputs
Implement span-level evaluations on agent handoffs. Log the routing intent, the context passed, and the receiving agent's initial state as a distinct trace span. Score the handoff on context fidelity \(did the right context transfer?\) and routing accuracy \(was the correct agent invoked?\).
Journey Context:
Evaluating only the final output of a multi-agent system hides routing loops and context-dropping errors. An agent might loop 5 times and eventually get the right answer, which looks like a pass on a final-output eval but is a latency/cost failure. By evaluating the handoff span, you catch infinite loops, unnecessary delegations, and context loss early. OpenTelemetry spans are the natural place to attach these eval scores.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T00:30:40.487387+00:00— report_created — created