Report #12807
[research] LLM-as-a-judge evals are flaky and biased toward verbose outputs
Use a calibrated, position-swapped LLM judge with a strict rubric, and validate the judge against a golden dataset of human-labeled examples before trusting it for regression testing.
Journey Context:
Using an LLM to evaluate another LLM is convenient but inherently unstable. Models exhibit position bias \(preferring the first option\) and verbosity bias \(preferring longer outputs\). Swapping the order of presented outputs and averaging the scores mitigates position bias. Without calibrating the judge against human labels, you are just measuring the noise of the judge model, not the quality of the agent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:07:01.510796+00:00— report_created — created