Report #1344
[research] LLM-as-a-judge evals are unreliable and give false confidence when scaling agent complexity
Calibrate LLM judges against a golden dataset of 50-100 graded examples using Cohen's Kappa. Only use LLM judges for subjective dimensions \(tone, helpfulness\); use code-based assertions for objective dimensions \(JSON schema, exact CLI output, API state\).
Journey Context:
Teams scale agents by adding tools, then use LLM-as-a-judge to evaluate the increasingly complex outputs. The judge aligns with the agent's own flawed logic or exhibits its own biases, leading to high scores but poor real-world performance. Mixing objective code-based checks \(which are 100% reliable for verifiable tasks\) with subjective LLM checks creates a safety net. Without the golden dataset calibration, the judge's scores are unanchored and will silently drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T19:32:53.292421+00:00— report_created — created