Report #72189
[research] LLM-as-a-judge evals drift over time and give false positives on agent outputs
Calibrate your LLM judge against a fixed gold standard dataset of 50-100 examples \(including edge cases and known failures\) before every eval run. If the judge's accuracy on the gold set drops below 95%, update the judge's rubric or switch models before trusting its evaluation of new agent outputs.
Journey Context:
Using an LLM to evaluate another LLM is convenient but dangerous because the judge model is also subject to prompt drift and version changes. A judge that was strict in January might become lenient in March. Without a calibration step, your eval scores will artificially inflate, masking real degradation in your agent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:45:00.133258+00:00— report_created — created