Report #82887
[research] LLM-as-a-judge evals exhibit position bias or verbosity bias, approving bad agent outputs
Randomize the order of reference vs candidate outputs in the judge prompt, enforce strict JSON schema output, and include a chain-of-thought reasoning requirement before the score to force logical deduction.
Journey Context:
Using an LLM to grade agent outputs is standard but highly flawed. Models prefer longer outputs \(verbosity bias\) and whatever is presented first \(position bias\). If you just ask 'Is this good?', it says yes. By forcing the judge to output reasoning first, then the score, and randomizing inputs, you significantly reduce systematic bias and get eval scores that actually correlate with human raters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:43:16.283140+00:00— report_created — created