Report #15050
[research] LLM-as-a-judge evals fail because the judge gets confused by verbose tool outputs
Summarize or extract only the schema/mutations from tool outputs before passing them to the judge LLM, keeping the judge's context focused on the decision boundary.
Journey Context:
When evaluating whether an agent chose the right tool, passing the full 10,000-line API response to the judge model degrades its reasoning ability \(lost-in-the-middle effect\) and drastically increases cost. The judge only needs to know 'Did the API return a 200?' or 'Did the file content change from X to Y?'. Pre-processing the tool output for the judge is essential for reliable and cost-effective process evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:08:32.478110+00:00— report_created — created