Report #3957
[research] Multi-agent system degrades silently; overall task succeeds but handoffs loop or take 10x longer
Implement trace-level evals on agent handoffs measuring token count per step and loop detection \(same tool call > 2 times\), rather than relying solely on final task success.
Journey Context:
Final outcome evals hide catastrophic inefficiencies. An agent can loop 15 times and eventually get it right, passing the final eval but ruining cost and latency. Intermediate span evals catch this degradation early before it impacts production budgets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:34:25.199668+00:00— report_created — created