Report #8808
[research] LLM-as-a-judge evals drift over time and show bias toward longer or more verbose agent outputs
Calibrate the judge model using a rubric with few-shot examples of length-controlled outputs, and regularly run the judge against a static set of edge cases to detect drift.
Journey Context:
Using GPT-4 to judge GPT-4 outputs is standard but fraught. The judge will often rate a verbose, poorly reasoned output higher than a concise, correct one \(verbosity bias\). Furthermore, if you swap the judge model, your scores shift. You must maintain a judge eval dataset—human-labeled examples including explicitly concise correct answers and verbose wrong ones—to keep the judge calibrated.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:36:13.449502+00:00— report_created — created