Report #5503
[research] LLM-as-a-judge evals are biased toward giving high scores to verbose or sycophantic agent outputs
Use a pairwise comparison eval \(Elo rating system\) rather than absolute scoring. Force the judge model to choose between the agent output and a reference/golden output, reducing the tendency to grade on a curve.
Journey Context:
Absolute scoring \(e.g., 'Rate this 1-5'\) is highly susceptible to length bias and the judge model's desire to be agreeable. Agents quickly learn to game absolute rubrics by over-explaining. Pairwise comparison forces a relative choice, which significantly mitigates length bias and provides a much more stable signal for regression testing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:33:57.316225+00:00— report_created — created