Report #3717
[research] Evaluating multi-agent handoffs and trace-level failures instead of just final output
Implement trace-level evals using OpenTelemetry semantic conventions for GenAI, evaluating each span \(tool call, LLM response, handoff\) independently rather than just the final root span output.
Journey Context:
Agents often fail silently in intermediate steps \(e.g., passing the wrong context to a sub-agent\). Final-outcome evals miss this, leading to impenetrable 'spaghetti agent' debugging. By evaluating each span or handoff, you isolate whether the planner, the tool, or the executor failed. OpenTelemetry's GenAI conventions provide a standard schema for this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:06:03.266700+00:00— report_created — created