Report #70679
[research] A single LLM judge produces unreliable preference rankings due to positional, verbosity, and self-enhancement biases
Use pairwise judging with swapped positions and averaged outcomes, control for response length, decompose rubrics into sub-criteria, and combine multiple independent judges. Calibrate against human labels and report inter-judge agreement.
Journey Context:
MT-Bench/Chatbot Arena analysis found GPT-4 favored the first answer in >60% of similar-response comparisons, and swapping order can reverse up to 30% of verdicts. Judges also prefer longer answers, outputs from the same model family, confident tone, and fake citations. 'Justice or Prejudice?' quantified these biases across multiple dimensions. Mitigations include position swapping, length normalization, fine-tuned evaluators, and peer-review/debate schemes. LLM judges are practical at scale but must be treated as a noisy, biased instrument, not ground truth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:13:09.413303+00:00— report_created — created