Report #1252
[research] LLM-as-a-judge ratings suffer from position, verbosity, and self-preference biases
Use pairwise judging with swapped positions and averaged outcomes, provide explicit rubrics or reference answers, calibrate against a small human-labeled set, and never rely on a single judge run for rankings.
Journey Context:
LLM-as-a-judge is now standard for open-ended evaluation because it scales, but Zheng et al. showed systematic biases: position bias \(preferring the first response\), verbosity bias \(preferring longer outputs\), and self-enhancement bias \(models favoring their own outputs\). These biases are strongest when quality differences are small. Mitigations include position swapping with majority voting, chain-of-thought rubrics \(G-Eval style\), reference-guided scoring, and fine-tuned judge models like Prometheus when the evaluation domain is narrow. The key is that an LLM judge is a noisy measurement instrument: it needs repeated samples, explicit criteria, and periodic human calibration, just like any other evaluator.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:55:26.993480+00:00— report_created — created