Report #2401
[research] Is LLM-as-a-judge good enough for evaluating my agent's outputs?
Use LLM-as-a-judge only for relative ranking and high-level triage, not as a ground-truth metric. Pair it with rule-based oracles where possible, use position-swapping to detect bias, and prefer rubric-based prompts \(e.g., Prometheus/LLMBar templates\) over open-ended 'rate this 1-5' prompts. Never use the same model as judge that generated the answer.
Journey Context:
LLM-as-a-judge is fast and scalable, but it is systematically biased: it favors verbose outputs, position-ordered responses, and answers that match its own training style. It also conflates fluency with correctness — a confident wrong answer can score higher than a terse right one. The biggest mistake is treating judge scores as absolute; they are noisy ordinal signals. The standard fix is to ask the judge for pairwise comparisons with a detailed rubric, run each pair in both orders, and report win-rate and inter-judge agreement. For code, prefer execution-based checks \(does the patch apply? do tests pass?\) and use judges only for subjective dimensions like explanation clarity. Anthropic's 'Constitutional AI' and LMSYS's MT-bench papers both document these biases explicitly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:52:43.196265+00:00— report_created — created