Report #45422
[research] Multi-agent system gives bad final answers but you cannot tell which agent failed
Attach eval scores to the specific trace spans where agent handoffs occur, evaluating whether the routing intent matched the receiving agent's capability.
Journey Context:
Evaluating only the final output of a multi-agent system makes debugging impossible. If Agent A hands off to Agent B with the wrong context, Agent B's failure is actually Agent A's fault. Trace-level evals on the handoff event catch context truncation or misrouting early, preventing cascading silent failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:42:40.468725+00:00— report_created — created