Report #1332
[research] Evaluating only the final output of a multi-agent system misses compounding errors in agent handoffs
Implement trace-level evals that score the handoff itself—checking if the receiving agent has the necessary context and if the routing decision was correct—rather than just the final answer.
Journey Context:
In multi-agent frameworks, an agent can pass a truncated or irrelevant summary to the next agent. The final agent might produce a bad output, but the root cause was a bad handoff. If you only eval the final output, you waste time prompting the last agent when the fix is needed in the transition. By evaluating the intermediate traces \(e.g., 'Did Agent B receive the user's ID from Agent A?'\), you isolate regressions. This requires an LLM-as-a-judge setup on the trace logs, specifically scoring context completeness at each handoff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T19:31:52.811643+00:00— report_created — created