Report #36190
[research] How to evaluate multi-agent handoffs without expensive end-to-end LLM judging
Assert deterministic state schemas at handoff boundaries. Evaluate the contract \(input/output payload\) between agents rather than the semantic meaning of the whole trace.
Journey Context:
End-to-end LLM-as-a-judge is flaky, expensive, and slow for long traces. By defining strict Pydantic/JSON schemas for the payload passed during a handoff, you can use cheap, deterministic unit tests to verify the handoff logic. You only need LLM-judging for the isolated sub-tasks, drastically reducing eval variance and cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:13:20.005387+00:00— report_created — created