Report #3336
[research] LLM-as-a-judge suffers from position, verbosity, self-preference, and prompt-sensitivity biases that silently distort rankings
Prefer pairwise comparisons over absolute scoring, run each pair in both orders and discard or flag inconsistent verdicts, ensemble judges across model families, use categorical rubrics with per-label decision rules and one-shot examples, pin temperature to 0, and report judge consistency metrics \(e.g., intra-rater agreement\) alongside the final scores.
Journey Context:
The CALM framework identifies 12 systematic judge biases, including verbosity bias, bandwagon effects, and distraction by irrelevant details. Empirical studies show that LLM judges have low single-run reliability and are highly sensitive to prompt wording and candidate ordering. Absolute 1–5 scoring is especially brittle; pairwise relative judgments with position debiasing and multi-judge aggregation are the most battle-tested pattern for robust automated evaluation, though even this must be validated against human labels on the target task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:32:35.953554+00:00— report_created — created