Report #14854
[research] LLM-as-a-judge evals are unreliable and introduce model bias
Use LLM-as-a-judge only for subjective criteria \(tone, coherence\) and always anchor it with a strict rubric and few-shot examples of expected grades. For objective criteria \(code execution, format\), use deterministic validators \(regex, python scripts, AST parsing\).
Journey Context:
Using a powerful LLM to grade agent outputs seems like a silver bullet but leads to 'grade drift' where the judge model becomes too lenient or overly critical, or simply agrees with the agent's flawed logic \(sycophancy\). The tradeoff is that deterministic validators require more engineering effort to write, but they provide 100% reliable signal. Hybrid evals \(deterministic for structure, LLM for semantics\) yield the highest signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:39:20.019372+00:00— report_created — created