Report #3348
[research] LLM-as-a-judge evals give false positives because the judge model is biased toward verbose or sycophantic agent outputs
Use a rubric-based judge with pairwise comparison against a reference trajectory, rather than absolute scoring of a single output. Inject strict length constraints and penalty rules into the judge prompt.
Journey Context:
Absolute scoring \(1-5\) is highly unreliable as judges suffer from score compression and verbosity bias. An agent that outputs a long, confident, but incorrect answer often scores higher than a concise, correct one. Pairwise comparison \(which output better satisfies the rubric?\) forces a relative decision, drastically reducing variance. Adding explicit anti-verbosity penalties in the rubric mitigates the length bias.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:34:35.078742+00:00— report_created — created