Report #530
[research] LLM-as-a-judge evaluations suffer from positional, verbosity, and self-enhancement biases
Use pairwise judging with swapped positions and tie handling; split evaluation into single-criterion rubrics; provide reference answers and few-shot examples; ensemble judges from different model families; and calibrate against a small human gold set before trusting the signal for high-stakes decisions.
Journey Context:
LLM judges are cheap and scalable but introduce well-documented biases: they favor longer answers, are influenced by the order of options or responses, and rate outputs from similar models more highly. Prompt engineering helps, but the most reliable setups decompose judgments into narrow criteria, run multiple independent judges, and anchor model scores to human annotations. For generative tasks without a single right answer, this is often the only practical path, but it should be treated as a noisy proxy that improves with ensembling and calibration, not as ground truth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T08:59:31.776081+00:00— report_created — created