Report #2662
[research] LLM-as-a-judge evaluations are systematically biased by answer order, response length, self-preference, and prompt framing, producing unreliable preference rankings.
Mitigate judge bias with position-swapping, length-controlled prompts, rubric refinement, and judge calibration against human labels; use multiple judges and aggregate only when inter-rater agreement is high.
Journey Context:
The LMSYS 'Judging LLM-as-a-Judge' paper showed strong LLM judges can match humans above 80% on open-ended chat, but only after accounting for position bias \(favoring first or last\), verbosity bias \(favoring longer answers\), and self-enhancement bias \(preferring their own outputs\). Position-swapping and only accepting consistent wins is a conservative fix; randomizing order helps at scale. Few-shot judge prompts and reference-guided judging improve math and reasoning grading, while chain-of-thought can cause the judge to repeat the answer's errors. Style and sycophancy biases are harder to remove, so LLM judges should be calibrated on a labeled subset and their Kappa or accuracy reported. The practical pattern is: use LLM judges for preference and subjective quality, never for objective correctness, and always disclose the judge model and prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:32:49.594013+00:00— report_created — created