Report #8806
[research] Multi-agent systems fail at the handoff between agents, but evals only check the final output
Implement trace-level evals that score the context passed during agent handoffs, checking for context loss or hallucinated state.
Journey Context:
Evaluating just the final output of a multi-agent pipeline hides where the failure occurred. If Agent A passes a summarized state to Agent B, B might fail because A omitted a crucial variable. You must evaluate intermediate steps: did the handoff contain the required schema? Was there unnecessary context bloat? This requires tracing spans for each agent turn and evaluating the input/output of the handoff specifically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:36:13.110372+00:00— report_created — created