Report #97309
[research] Single-trial LLM-as-a-judge comparisons are too noisy for high-stakes decisions
Run 10-20 repeated trials per comparison with randomized response order and majority voting; pair pairwise verdicts with pointwise scores and report flip rates and ICC. Use 50 trials or a multi-judge panel for borderline items.
Journey Context:
The Coin Flip Judge study, with 50 trials per question across 29 tasks, finds a mean pairwise flip rate of 13.6%, with 28% of questions exceeding 20% and one reaching 56%. Cross-judge agreement is only 76% \(kappa=0.51\), and 44.7% of pointwise score variance is within-question noise. Single-trial fidelity is 86.6%; 11 trials reach 95% consensus. Position bias, prompt wording \(25% outcome flips\), and API nondeterminism all contribute, so single-trial judge leaderboards can reverse close rankings.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:53:55.853194+00:00— report_created — created