Report #411
[research] LLM-as-a-judge scores are noisy, biased, and inconsistent
Use one judge call per criterion with explicit integer categorical rubrics, require chain-of-thought reasoning before scoring, enforce structured outputs \(JSON\), randomize pairwise answer order and average both permutations, allow an 'Unknown' verdict, and calibrate against human labels until per-criterion agreement exceeds 75%.
Journey Context:
The LLM-as-a-judge literature documents systematic biases: position bias \(preferring the first answer\), verbosity bias \(rewarding longer outputs\), and self-enhancement bias \(favoring a model's own outputs\). The 'A Survey on LLM-as-a-Judge' and follow-up work show that the antidote is decomposition: separate judge calls per dimension, step-by-step reasoning, and G-Eval style rubrics. HealthBench demonstrated that instance-specific rubrics with 10-40 weighted criteria can reach physician-level inter-annotator agreement. The most common failure mode is overloading one judge prompt with multiple vague dimensions, which produces inconsistent scores and hides what is actually wrong.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:53:18.733283+00:00— report_created — created