Report #6128
[research] LLM-as-judge eval scores drift and become unreliable over time
Maintain a frozen calibration set of 50\+ examples with human-annotated quality scores spanning the full quality range. Run the judge LLM against this set weekly. Measure agreement using Cohen's kappa \(target > 0.7\). If kappa drops below threshold, update the judge prompt or switch models before trusting any new eval results.
Journey Context:
LLM-as-judge is powerful but unstable. Judge models change behavior with provider updates, and prompt sensitivity means small formatting changes cause large eval swings. A calibration set acts as a ruler: it tells you when your measurement tool itself has changed. Without it, you cannot distinguish real quality changes from judge drift. The frozen set must span the full quality range \(including bad outputs\) because judges calibrated only on good outputs lose discriminative ability on failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:13:12.820556+00:00— report_created — created