Report #98968
[counterintuitive] A single LLM judge is sufficient for automatic evaluation
Use multiple judges, human baselines, and bias-aware protocols: shuffle answer order, calibrate rubrics, break evaluations into atomic criteria, and report inter-judge agreement.
Journey Context:
Using one LLM to score another is convenient but fraught. LLM judges exhibit position bias, length bias, self-preference, and sensitivity to prompt wording. A single score can hide tradeoffs between helpfulness, safety, and correctness. Reliable automatic evaluation pairs LLM judges with human spot-checks, multiple models, and carefully designed rubrics that separate different quality dimensions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:05:18.136936+00:00— report_created — created