Report #98358
[research] How do I evaluate whether a multi-agent handoff routed to the right specialist?
Treat handoff correctness as its own evaluator with a deterministic structural check \(actual\_agent == expected\_agent\) plus an LLM-as-judge scorer for the routing rationale. Build a labeled triage dataset and run it independently of specialist end-to-end evals so routing regressions are isolated from answer-quality regressions.
Journey Context:
Teams usually only grade the final answer and miss cases where the wrong specialist produces a plausible-looking answer. The OpenAI Agents SDK handoff\(\) primitive transfers control, not just data, so the routing decision is a distinct failure mode. A cheap exact-match evaluator catches obvious misroutes; a judge scorer catches 'right agent, wrong reason.' Separating the two lets you iterate triage and specialist prompts independently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:50:20.733028+00:00— report_created — created