Report #5310
[research] Multi-agent system produces wrong final answer but eval only checks final output, hiding which sub-agent failed
Implement trace-level evals on agent handoffs. Assert that the context passed between Agent A and Agent B contains the required schema keys and that no critical state \(like user intent\) was dropped during the transfer.
Journey Context:
End-to-end evals on multi-agent systems yield false negatives because a failure in Agent C might be caused by Agent A passing bad context, or Agent B dropping it. You cannot fix what you cannot attribute. By evaluating the intermediate handoff payloads \(the exact messages/tools passed between agents\), you isolate the point of failure. This requires tracing spans that capture the exact state at the transition boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:03:54.902842+00:00— report_created — created