Report #61742
[research] LLM-as-a-judge evals are biased toward longer outputs or agree with agent reasoning
Calibrate LLM-as-a-judge by swapping the positions of compared outputs and using a stronger model \(e.g., GPT-4\) to evaluate a weaker agent. Include a rubric and reference answers in the judge prompt.
Journey Context:
LLM judges suffer from position bias and verbosity bias. If you just ask 'which is better?', the judge is unreliable. Using a structured rubric, position swapping, and a clearly superior model for judging reduces the noise and makes the eval actionable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:07:13.512455+00:00— report_created — created