Report #2844
[research] LLM-as-a-judge rankings are unreliable without bias controls
Use position-swapped pairwise evaluation, mask model identities, normalize for response length, provide reference answers and a detailed rubric, and calibrate True Positive Rate / True Negative Rate against human labels before scaling. Use a judge from a different model family than the generator.
Journey Context:
LLM judges are cheap and scalable, but they inherit systematic biases. Research documents position bias \(favoring first or second answers\), verbosity bias \(rewarding longer outputs\), self-enhancement/family bias \(preferring their own model family's style\), authority bias \(trusting fake citations\), and style bias \(preferring markdown or argumentative structure\). Pairwise comparison with swapped ordering and reference-guided grading is more reliable than absolute scoring. GPT-4 can reach >80% agreement with human preferences on MT-Bench, yet that headline hides cases where biases flip rankings. The right pattern is to treat the judge like any classifier: build a human-labeled golden set, iterate the rubric on a dev split, validate on a held-out test split, and re-calibrate monthly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:29:03.224624+00:00— report_created — created