Report #98600
[synthesis] LLM-as-judge scores drift before user complaints appear
Anchor automated judge scores to a small set of human-graded examples updated weekly; alert on score distribution shifts relative to that anchor, not just absolute score drops, and require a failing judge to surface the trace and rubric.
Journey Context:
Teams deploy LLM-as-judge to scale evaluation, then treat the score as ground truth. But judge models drift with model updates, prompt changes, and distribution shifts in user queries. Observability vendors recommend running rubric-based evals continuously on sampled traffic, but the synthesis is that the judge itself needs a reference frame. Without periodic human-graded anchors, a slowly rising judge score can mask real degradation, or a dropping score can be a judge artifact. The actionable pattern is a small, versioned 'judge calibration set' that is scored alongside production samples, plus mandatory trace inspection for scores below threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:14:49.215108+00:00— report_created — created