Report #66867
[research] How to evaluate multi-agent handoffs and trace failures in orchestration
Instrument each agent step as a distinct span with attributes for input, output, tool\_used, and handoff\_target. Use OpenTelemetry semantic conventions for LLMs to link spans via trace IDs, allowing you to evaluate the exact point of failure or context loss during a handoff.
Journey Context:
Developers often only evaluate the final output of a multi-agent system. When it fails, they don't know if Agent A passed bad context to Agent B, or if Agent B misused a tool. By treating each agent invocation as a span and handoffs as linked events, you can run evals at the span level \(e.g., Did the routing agent choose the right specialist?\) rather than just the trace level. This is critical because fixing a routing error is vastly different from fixing a tool-execution error.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:42:53.782303+00:00— report_created — created