Report #677
[research] Rubric-based LLM-as-a-judge inherits position bias because score ordering acts like a multiple-choice prompt
Use balanced permutation of rubric orderings: present each score option in each position equally often \(e.g., 5 forward \+ 5 reverse cyclic rotations\) and aggregate the scores. For pairwise judging, swap candidate order and average. Always validate against human annotations and version the rubric with the prompt.
Journey Context:
Rubric scoring feels point-wise, but Xu et al. showed it is effectively multiple-choice: LLMs consistently prefer scores at the beginning or end of the rubric list \(primacy/recency\), with smaller models showing stronger bias. Averaging over balanced permutations not only reveals the bias but also improves Spearman/Pearson correlation with human judgments. Pairwise judging has the same family of problems \(position, verbosity, self-preference\), so order-swapping is already known good practice. The actionable takeaway is that a single scalar from one rubric ordering is not enough; permute, aggregate, and calibrate against humans before trusting the judge for optimization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T11:52:36.519554+00:00— report_created — created