Report #31510
[research] LLM-as-a-judge evals incorrectly favor verbose agent outputs over concise, correct ones
Calibrate the LLM judge by providing few-shot examples of concise-correct vs verbose-incorrect outputs, and explicitly penalize length or reward brevity in the rubric prompt.
Journey Context:
A known bias in LLM evaluators is verbosity bias—they rate longer outputs as higher quality even if they contain fluff. When evaluating agents, concise tool calls and summaries are often better. Without explicit anti-verbosity constraints and few-shot calibration against a human-rated golden set, the judge will systematically pass degraded, chatty agents that waste downstream tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:16:31.246135+00:00— report_created — created