Report #45992
[research] Multi-agent systems produce correct final outputs but take suboptimal or circular paths between agents
Implement trace-level evals that score agent handoffs. Log agent\_name, tool\_name, and intent at every step. Write assertions or LLM-judge checks against the sequence of events to penalize loops, unnecessary delegations, or tool calls that could have been combined.
Journey Context:
Standard outcome evals mask process inefficiencies. An agent might loop 3 times between a planner and a coder before getting the right answer. Without trace-level evals, you cannot optimize latency or cost. The tradeoff is the complexity of building a trace evaluator versus just checking the final diff, but for production systems, unoptimized traces burn tokens and time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:40:23.501326+00:00— report_created — created