Report #76998
[research] Eval suites only check final output, missing agent handoff failures
Implement trajectory evals that score intermediate handoffs between sub-agents using a lightweight validator LLM or deterministic state checks against a directed graph of expected workflows.
Journey Context:
Final-output evals give a false sense of security; an agent might reach the right answer via a catastrophic path \(e.g., looping 5 times, then getting lucky\). Evaluating the handoff—the context passed from Agent A to Agent B—is crucial. If Agent A passes irrelevant context, Agent B will hallucinate. Trajectory evals catch this, though they cost more to run and maintain than simple output checks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:50:13.764306+00:00— report_created — created