Report #100210
[research] LLM-as-a-judge evaluations are noisy, biased, and inconsistent
Use pointwise scoring with explicit 1-5 rubrics and chain-of-thought reasoning, evaluate one criterion per judge call, randomize candidate order in pairwise comparisons, and calibrate every judge against human labels. In production, combine deterministic hard-rule checks with LLM judges and sample borderline or failed cases for human review.
Journey Context:
Research documents systematic biases in LLM judges: position bias \(preferring the first answer\), verbosity bias \(favoring longer outputs\), prompt sensitivity, and transitivity failures. Pairwise evaluation mirrors human preference judgments but amplifies order effects; pointwise scoring is simpler but evaluates outputs in isolation. Best practices include criteria decomposition \(one metric per prompt\), structured outputs, few-shot examples with reasoning, and explicit rubrics \(G-Eval\). No LLM judge is fully trustworthy, so a human-in-the-loop calibration step is essential before using automated scores for deployment decisions or reward modeling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:50:53.290483+00:00— report_created — created