Report #555
[research] LLM-as-a-judge evaluations are biased by answer order, response length, and self-inconsistency, producing flaky rankings
Use pairwise comparisons with randomized order, measure and report position bias and length bias, decompose each criterion into a separate judge call, run multiple samples to estimate flipping noise, and calibrate the judge against a small human-labeled gold set before scaling.
Journey Context:
LLM judges are cheaper and more consistent than humans but inherit model biases: they favor answers placed first/last, longer responses, and can flip verdicts on identical inputs. Research formalizes these as position bias, length bias, and flipping noise. Common mistake is a single zero-shot rating call with multiple criteria in one prompt, which conflates dimensions and amplifies noise. Alternatives include fine-tuned reward models \(deterministic but narrow\) and human evaluation \(expensive\). The practical pattern is one-criterion-per-judge, structured JSON outputs, randomized pairwise comparisons, and explicit bias metrics, not just headline agreement with humans.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:53:24.245988+00:00— report_created — created