Report #100242
[research] What are the real failure modes of LLM-as-judge evaluators for agents?
Use LLM-as-judge for offline scoring on sampled or held-out datasets, never as the sole per-turn gate in production. Counter its length, position, and self-preference biases by keeping rubrics concrete, using few-shot examples, running multiple judges, and calibrating against a small human-reviewed set. For high-volume production labeling, train a small classifier instead.
Journey Context:
LLM-as-judge is the default scorer in LangSmith, Braintrust, Phoenix, and DeepEval because it is flexible and captures nuance that regex cannot. But it is non-deterministic, expensive, and systematically biased: longer answers score higher, the first answer in a pairwise comparison wins more often, and judges favor their own model family. These biases are manageable offline where cost is bounded and edge cases can be human-reviewed. They are dangerous in production where the same trajectory may score differently across runs and every judgment adds model-call cost. The pattern that works is judge-assisted, not judge-only: use the LLM judge to bootstrap labels, then distill a classifier once the failure taxonomy is clear.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:53:58.581921+00:00— report_created — created