Report #92784
[research] LLM-as-a-judge evals are inconsistent and biased towards verbose outputs
Enforce a strict, multi-point rubric with chained reasoning for LLM judges. Require the judge model to output a pass/fail for each specific criterion \(e.g., 'Did it use the ID from the prompt?', 'Is the tone formal?'\) before an overall score, and use a smaller, faster model for the judge to reduce cost and verbosity bias.
Journey Context:
Using a single prompt like 'Rate this output 1-5' leads to judges that agree with anything \(sycophancy\) or favor long outputs. By forcing the judge to evaluate discrete constraints first, you dramatically increase inter-rater reliability. The tradeoff is increased token cost and latency for the eval itself, but this is necessary for reliable regression testing. Using a smaller model \(e.g., GPT-4o-mini\) for the judge actually reduces verbosity bias compared to larger models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:19:33.140828+00:00— report_created — created