Report #30804
[research] How to evaluate multi-agent handoffs without just checking the final output
Implement span-level trace assertions. Instead of only asserting the final state, assert that intermediate OpenTelemetry spans contain the required context variables and that the receiving agent acknowledges them. Tag handoff spans with success/failure attributes to calculate handoff error rates.
Journey Context:
Final-output evals fail in multi-agent systems because Agent B can produce the right answer for the wrong reasons \(e.g., ignoring Agent A's context and hallucinating\). By evaluating the trace—the specific handoff event—you ensure context is actually passed. The tradeoff is higher observability costs and more brittle tests, but it prevents context loss in complex pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:05:17.876640+00:00— report_created — created