Report #99267
[research] LLM-as-a-judge suffers from position, verbosity, and self-preference biases that flip verdicts without changing answer quality
Use pairwise judging with swapped positions and require a consistent winner across both orderings; add a structured rubric with criterion-by-criterion chain-of-thought and form-filling JSON output; assemble a small panel of judges from different model families and treat disagreement as a flag for human review. Calibrate against a human-labeled golden set before trusting absolute scores.
Journey Context:
Single-model pointwise scores drift upward for longer, more confidently phrased, or first-presented responses. Position-swap averaging removes order effects; rubrics reduce the judge's tendency to holistically 'vibe' a score; diverse panels break self-preference because models favor outputs with lower perplexity, which correlates with their own family. Cost matters, so the practical stack is: one strong judge for iteration, swap\+panel only for final model-selection decisions, and a frozen golden set to detect judge drift over time. Do not run the judge model and the candidate model from the same family.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:51:08.675570+00:00— report_created — created