Report #18050
[research] LLM-as-a-judge evals are inconsistent and biased toward verbose outputs
Use a multi-point rubric with explicit scoring criteria \(e.g., 0-2 scale with strict definitions\) and swap the candidate/reference order in pairwise comparisons to mitigate position bias.
Journey Context:
Generic prompts like 'which output is better?' yield noisy evals. LLM judges suffer from verbosity bias \(longer = better\) and position bias \(first = better\). A strict, multi-point rubric forces the judge to evaluate specific constraints. Swapping order in pairwise tests measures and corrects for position bias, making regression suites reliable enough to block merges.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T07:10:58.977362+00:00— report_created — created