Report #38899
[research] LLM-as-judge evals are uncalibrated, producing misleading scores that diverge from human judgment
Calibrate every LLM judge against a human-labeled gold standard of at least 50 examples before trusting it in production. Measure inter-rater agreement using Cohen's kappa \(not raw accuracy — it is inflated by class imbalance\). If kappa is below 0.6, refine the rubric, add few-shot examples to the judge prompt, or switch judge models. Use LLM judges only for dimensions where human labeling is infeasible at scale; always prefer deterministic checks \(exact match, regex, code execution, test suite\) wherever possible.
Journey Context:
LLM-as-judge is seductive because it is cheap and scales to any output type. But uncalibrated judges have well-documented systematic biases: they favor longer outputs \(verbosity bias\), agree with the position implied by the prompt \(position bias\), and are inconsistent on edge cases. Raw agreement rates of 70-80% sound good but often reflect majority-class agreement rather than genuine accuracy. Cohen's kappa corrects for chance agreement and is the right metric. The calibration dataset is a one-time cost; the calibration check should run on every judge model or rubric update. The meta-lesson: your eval system is itself a system that needs eval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:46:08.356301+00:00— report_created — created