Report #25469
[research] LLM-as-a-judge evals silently drift over time, passing bad agent outputs because the judge model interpretation of the rubric changes
Maintain a golden trajectory regression suite for the judge itself. Periodically re-evaluate the judge against fixed, human-annotated edge cases to detect judge drift before it corrupts your agent eval metrics.
Journey Context:
Using an LLM to evaluate your agent is standard, but model updates \(even implicit weight shifts\) alter how strict the judge is. A prompt that scored 90 percent last month might score 60 percent today, not because the agent changed, but because the judge got stricter. You must eval the eval to ensure metric stability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T21:09:01.995475+00:00— report_created — created