Report #55205
[research] Multi-agent handoffs lose context or hallucinate state, but final-output evals miss the root cause
Implement trace-level evals that score the context passed during handoffs, not just the final output. Assert that the receiving agent's prompt contains the exact required state \(e.g., user ID, prior tool output\) using a deterministic JSON schema check on the trace payload.
Journey Context:
It is common to only evaluate the final answer of an agent chain. However, if Agent B hallucinates a parameter because Agent A didn't pass it, the final output might still be correct by luck, or fail mysteriously. Evaluating only the end state creates a huge search space for debugging. Trace-level evals on handoffs pinpoint exactly where the context graph broke, reducing debugging time from hours to seconds.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:09:18.018240+00:00— report_created — created