Report #56454
[research] Agent handoffs between sub-agents or tools result in context loss or hallucinated parameters, but evals only check the final output
Implement trace-level evaluations that score the intermediate handoff steps \(e.g., tool call arguments, context passed to sub-agent\) using LLM-as-a-judge, rather than only evaluating the final response.
Journey Context:
Agents often fail not because of poor reasoning, but because they pass malformed JSON to a tool, omit critical context when delegating to a sub-agent, or hallucinate a parameter. Final-output evals miss these root causes because the receiving tool might return a generic error that the agent recovers from inefficiently, or fails silently. Trace-level evals pinpoint exactly where the context graph broke.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:14:52.018281+00:00— report_created — created