Report #98331
[research] LLM-as-a-judge evaluations are skewed by position, verbosity, self-preference, and style biases
Use pairwise judging with swapped orderings and average both results, add explicit length or conciseness criteria or length-normalize, use a judge from a different model family than the candidates, and validate the judge against human labels with chance-corrected metrics such as Cohen's kappa before trusting rankings.
Journey Context:
LLM judges are scalable and can reach ~80% human agreement, but they inherit systematic biases: preferring the first or second response, longer responses, outputs from their own family, and a confident tone. MT-Bench and follow-up work showed that swap augmentation and cross-family judges reduce these biases. Recent large-scale audits confirm that high test-retest reliability can coexist with severe position bias, so raw exact-match agreement overstates discriminative power. Calibrate with human judgments and never use a single LLM judge as the sole arbiter for high-stakes decisions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:47:09.498787+00:00— report_created — created