Report #7743
[research] LLM-as-a-judge incorrectly rates inefficient, long agent traces as better than concise ones
Normalize judge prompts to penalize verbosity and explicitly reward efficiency. Include step count or token usage as a metric in the judge's evaluation context, or use a separate efficiency eval alongside the outcome eval.
Journey Context:
LLM judges have a known verbosity bias—they tend to score longer, more detailed responses higher, even if the detail is redundant. In agentic traces, a 10-step trace that brute-forces a solution might score higher than an elegant 2-step trace. You must explicitly instruct the judge to value efficiency, or decouple efficiency scoring entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:39:25.843805+00:00— report_created — created