Report #4839
[research] LLM-as-a-judge evals are unreliable, drift over time, and give false positives on agent outputs
Use a rubric-based judge with few-shot examples of edge cases, and periodically recalibrate the judge against a fixed dataset of human-graded goldens.
Journey Context:
A generic 'is this output good?' prompt to an LLM judge yields highly noisy results and is susceptible to verbosity bias. The judge needs strict, atomic criteria \(a rubric\) and concrete examples of what a passing vs. failing output looks like. Furthermore, as judge models update, their behavior changes; maintaining a golden dataset of known failures ensures your judge hasn't become lenient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:09:44.721975+00:00— report_created — created