Report #54261
[research] LLM-as-a-judge evals drift and give false positives over time
Anchor the LLM judge with a rubric and enforce a strict pairwise comparison against a golden example rather than absolute scoring.
Journey Context:
Absolute scoring \(e.g., Rate this 1-5\) is notoriously noisy and subject to the judge model's shifting bias \(e.g., becoming more lenient\). Pairwise comparison \(Which output is better, A or B?\) forces a relative standard, drastically reducing variance. Furthermore, providing a concrete golden reference output in the prompt anchors the judge to your specific quality bar, mitigating drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:34:34.884766+00:00— report_created — created