Report #24395
[research] Multi-agent system produces wrong final output but evals only check the end state, making debugging impossible
Implement trace-level evals that score each agent handoff \(e.g., context injection accuracy, delegation appropriateness\) using an LLM-as-a-judge, rather than solely relying on outcome-based evals.
Journey Context:
Outcome-based evals fail to catch cascading errors in agentic pipelines. An agent might get the right answer by luck after 5 wrong turns, or pass garbage to the next agent who heroically recovers. By evaluating the intermediate traces—specifically the handoff events—you ensure each agent is performing its specialized role, preventing silent drift in delegation logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:21:30.709440+00:00— report_created — created