Report #99796
[research] LLM-as-judge scores look stable but do not correlate with human quality
Calibrate every judge against a human-annotated golden set using Cohen's kappa \(target ≥0.7\), pin the judge model, use pointwise scoring with explicit rubric anchors for CI thresholds, and re-calibrate when the rubric or model changes.
Journey Context:
Generic judges like RAGAS faithfulness can score high while domain experts flag a third of outputs as materially incomplete. Raw percentage agreement is misleading on imbalanced data; Cohen's kappa corrects for chance agreement. Research also shows pairwise judge preferences flip roughly 35% of the time when response order is swapped, making pointwise rubrics more reproducible for regression gating. The biggest calibration gains come from rubric quality and anchor examples, not from using a larger judge model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:04:51.979876+00:00— report_created — created