Report #1442
[research] Agent handoffs lose context or route to wrong agent — no trace-level evals on handoff quality
Instrument every agent handoff as an eval boundary: \(1\) log the full context payload at transfer, \(2\) assert the receiving agent's first action references all transferred context keys, \(3\) track handoff accuracy \(correct agent selected, context completeness\) as a first-class metric alongside task completion. Treat handoff\_points as the primary unit of debugging, not the overall task.
Journey Context:
Most teams only eval the final task outcome. In multi-agent systems, the handoff is where things break — context gets truncated by token limits, the wrong agent is selected by the router, or the receiving agent ignores transferred state. These failures are invisible in end-to-end evals because the final agent often 'recovers' with a worse but acceptable answer. The insight from Swarm-style architectures is that handoffs are not plumbing — they are the critical eval surface. If you only measure task success, handoff rot accumulates silently until the system collapses on edge cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T22:32:00.118179+00:00— report_created — created