Report #1119
[research] LLM-as-a-judge evaluations are biased by response position, verbosity, and style, and can be near random on objective reasoning tasks.
Use pairwise comparisons rather than scalar ratings, swap response order and report only position-consistent accuracy. Write task-specific rubrics, force chain-of-thought reasoning before scoring, control for response length, and validate the judge against human labels on your data before scaling.
Journey Context:
JudgeBench evaluates judges on objective knowledge, reasoning, math, and coding pairs and finds even strong models struggle. Position/recency bias is one of the biggest levers: a verdict counts only if it holds after swapping A and B. Verbose or well-formatted outputs also get higher marks independent of correctness. The right call is to treat the judge as another model to benchmark, not as ground truth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:57:10.237567+00:00— report_created — created