Report #70047
[research] LLM-as-a-judge evals are biased and agree with themselves
Calibrate LLM judges against human-annotated golden labels and use a different, stronger model for judging than the agent model.
Journey Context:
Using the same model to judge itself leads to self-preference bias. Using a weaker model to judge a stronger model leads to poor evaluation. You need a calibrated judge \(e.g., Claude 3.5 Sonnet judging GPT-4o\) and must measure inter-rater reliability \(Cohen's Kappa\) against humans to ensure the judge is actually accurate and not just confidently wrong.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:09:09.044818+00:00— report_created — created