Report #1671
[research] A single LLM judge with a generic rubric gives reliable quality scores
Use pairwise comparison with position swapping, multiple judge models, per-dimension rubrics, and human calibration; never rely on a single pointwise score for decisions.
Journey Context:
LLM judges suffer from position bias \(preferring first or second response\), verbosity bias \(longer answers score higher\), self-preference bias \(favoring their own model family\), and rubric sensitivity. The MT-Bench / Chatbot Arena paper showed that strong LLM judges can match human preferences only when biases are mitigated. Pairwise comparisons are more stable than absolute scoring; swapping A/B order and averaging reduces position effects. Separate judges for factuality, helpfulness, and format produce cleaner gradients than one 'overall quality' score.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:47:48.771633+00:00— report_created — created