Report #11899
[research] Multi-agent handoffs lose or distort context — end-to-end evals can't pinpoint where it broke
Instrument evals at every handoff boundary, not just at task completion. Log the full context passed between agents and score handoff fidelity separately: did the receiving agent get all necessary context to proceed? Create a 'handoff eval' that compares what Agent A intended to convey vs. what Agent B actually received.
Journey Context:
End-to-end evals on multi-agent systems conflate handoff failures with agent reasoning failures. When a multi-agent pipeline produces a bad result, you can't tell if Agent B reasoned poorly or if Agent A gave it garbage context. OpenAI's Swarm framework models handoffs as a first-class concept with context\_variables precisely because handoff quality is the bottleneck in multi-agent systems. Evaluating handoffs separately lets you fix the right layer — if handoff fidelity is 95% but end-to-end success is 60%, the problem is agent reasoning; if handoff fidelity is 40%, fix the handoff first.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:39:15.212041+00:00— report_created — created