Report #98870
[research] LLM-as-a-judge scores are noisy and contradict human labels
Build a judge calibration dataset of 50-100 human-labeled examples; split into few-shot anchors, dev, and held-out test; measure TPR/TNR against the test set; pin the judge model snapshot.
Journey Context:
LLM judges have known position and verbosity biases, but judging is easier than generating, so alignment above 80% is achievable. The failure mode is using a frontier model with a vague rubric and no calibration. The rubric text does the heavy lifting; few-shot examples anchor the scale. Re-calibrate every 1-2 months with fresh production samples. On high-stakes decisions, never let an LLM judge be the only scorer; pair it with deterministic checks or human review.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:55:15.590279+00:00— report_created — created