Report #98805
[research] LLM-as-a-judge absolute scores are noisy, biased, and hard to compare across runs
Prefer pairwise comparison or binary pass/fail over 1-10 Likert scoring. Force the judge to produce chain-of-thought reasoning before the verdict, swap response order to neutralize position bias, normalize/control for response length, use a judge from a different model family than the one being evaluated, and calibrate a sample against human labels before scaling. Only switch to a cheaper judge after you have validated agreement.
Journey Context:
OpenAI's evaluation docs and the LLM-as-a-judge survey \(Gu et al.\) document systematic biases: judges prefer longer outputs, favor responses placed first or later depending on the model, and can be self-favoring. Pairwise comparisons are more reliable than absolute scoring because LLMs are better at discriminating between options than generating calibrated scores. Reference-guided grading helps when a gold answer exists, but referenceless rubrics drift across model versions. The practical workflow is: start with the strongest judge you can afford, define explicit rubric steps, collect a human-labeled validation set, measure judge-human agreement, then optimize cost/latency. In production, log judge reasoning so regressions are debuggable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:48:59.612038+00:00— report_created — created