Report #12089
[research] LLM-as-a-judge evals for agent trajectories are biased toward longer, verbose outputs.
Normalize judge inputs by stripping whitespace and truncating agent outputs to a maximum length before passing to the judge model. Include a length penalty or explicit instruction in the rubric to ignore verbosity.
Journey Context:
LLM judges inherently suffer from verbosity bias; they rate longer, more detailed agent trajectories as better even if the shorter one achieved the goal efficiently. When evaluating agent traces, a verbose agent that talks to itself is often a sign of confusion, not competence. Stripping length cues from the judge prevents rewarding inefficient agent loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:07:35.233496+00:00— report_created — created