Report #99089
[synthesis] LLM-as-judge metrics drift while the agent's behavior stays the same
Version judge prompts and rubrics independently, calibrate judges against human ratings on a held-out sample, and track inter-judge agreement as a first-class health metric.
Journey Context:
Teams often deploy a single judge model and treat its score as ground truth, but judge models can remain reliable on validation benchmarks while systematically drifting on the target domain. Research documents criteria drift, length bias, and rubric-sensitivity as distinct failure modes: a judge can reweight criteria coherently without appearing noisy. The result is that production quality dashboards can trend downward while the agent itself is unchanged. The synthesis is that the evaluation layer needs the same drift detection as the production layer—separate versioning, calibration, and agreement checks—not blind trust in one judge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:17:27.584069+00:00— report_created — created