Report #3569
[research] LLM-as-a-judge evaluations are noisy because of position, verbosity, and self-preference biases
Use pairwise comparison with position-swapped replicates, a reference answer/rubric, and aggregate over multiple judge prompts; report inter-judge agreement and use a stronger judge than the model being evaluated.
Journey Context:
Open-ended chat evaluation is expensive with human raters, so using a strong LLM as judge is attractive. GPT-4 as judge matches human preference around 80% but exhibits position bias \(prefers first/second answer\), verbosity bias \(favors longer outputs\), and self-enhancement. Mitigations include scoring against a gold reference, running A-vs-B and B-vs-A and taking the consistent winner, and forcing rubric-based point allocation before the final verdict. MT-bench and Chatbot Arena follow these patterns. Do not trust single-shot LLM ratings for safety-critical claims.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:34:17.733379+00:00— report_created — created