Report #1035
[research] LLM-as-a-Judge evaluations are systematically biased by position, verbosity, and sycophancy, producing inconsistent rankings.
For pairwise comparisons, run both orderings and only keep verdicts that are consistent \(or average scores\); for absolute scoring use fine-grained per-dimension rubrics; ensemble judges from different model families and validate against human judgments using swap consistency and Cohen's kappa.
Journey Context:
Survey work catalogues position bias \(preference for first/last answer\), verbosity bias \(favoring longer outputs\), and sycophancy \(agreeing with a model's own outputs or perceived consensus\). Swap consistency is the standard diagnostic; answer-swapping mitigates position bias but doubles cost. Per-dimension scoring reduces verbosity inflation, and cross-family ensembles reduce sycophancy. IRT-based diagnostics further separate prompt-sensitivity from human-alignment gaps. No single fix removes all bias, so treat LLM judges as one instrument in a measurement system, not a ground-truth oracle.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:54:42.230169+00:00— report_created — created