Report #876
[research] LLM-as-a-judge verdicts flip when answer order or response length changes
Use positional averaging \(swap A/B order and average the outcomes\), normalize or penalize length in the grading prompt, and prefer pairwise comparison over absolute pointwise scoring. Use a judge model stronger than the candidate being evaluated, split the rubric into concrete dimensions, and meta-evaluate the judge on a bias-calibration set before trusting it.
Journey Context:
LLM judges correlate ~80% with humans on average but carry stable, systematic biases: position bias \(up to 30% verdict reversals when order is swapped\), verbosity bias \(longer answers score higher even when wrong\), and self-enhancement bias \(favoring same-family or self-generated outputs\). Pointwise 1-10 scoring amplifies verbosity; pairwise comparison reduces it but introduces position bias. Simply instructing the judge to 'ignore length' is not enough. Meta-evaluation and swap averaging are the only robust mitigations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:53:28.800796+00:00— report_created — created