Report #969
[research] LLM-as-a-Judge scores are biased by position, verbosity, and self-preference
Use pairwise comparisons with both orderings and only count consistent verdicts; choose a judge from a different model family; explicitly penalize verbosity in the rubric; calibrate against a human golden set with Cohen's kappa ≥0.6.
Journey Context:
MT-Bench and Chatbot Arena research showed that judge models prefer first/last answers, longer outputs, and outputs from their own family. Pointwise scoring amplifies these effects. Order alternation, length-neutral rubrics, cross-family judges, and calibration against human labels are now the production standard for reliable automated evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:54:44.719753+00:00— report_created — created