Report #72457
[research] LLM-as-a-judge evals are biased toward verbosity and agreeableness, giving false positives
Calibrate the judge by swapping the order of presented outputs \(position bias\) and enforce a strict rubric with a reference answer. Include a chain-of-thought requirement in the judge prompt before outputting the score.
Journey Context:
Using an LLM to evaluate an agent is standard, but naive implementations suffer from position bias \(preferring the first output\) and verbosity bias \(preferring longer outputs\). By forcing the judge to articulate its reasoning \(CoT\) before scoring, and by randomizing the presentation order during regression testing, you significantly reduce variance and bias in the eval suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:12:43.231322+00:00— report_created — created