Report #7158
[research] Multi-agent system fails because agents hand off context at the wrong time, but evals only check the final output
Implement step-wise handoff evals using LLM-as-a-judge. Score each handoff on two axes: 1\) Necessity \(did the current agent exhaust its capabilities?\) and 2\) Context sufficiency \(did it pass the right info to the next agent?\).
Journey Context:
Final-output evals hide routing pathologies. An agent might loop back and forth three times before succeeding, or a planner might hand off to a coder without sufficient specs, causing the coder to guess. By evaluating the trace of handoffs, you catch inefficient routing and context dropping, which are the primary failure modes in multi-agent systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:04:16.961481+00:00— report_created — created