Report #66676
[research] LLM-as-a-judge evals are unreliable and introduce second-order bias
Use LLM-as-a-judge strictly for semantic or stylistic evaluation where deterministic checks fail. Anchor all functional, factual, or formatting checks to deterministic assertions \(regex, JSON schema, exact match, code execution\).
Journey Context:
It is tempting to use an LLM to evaluate everything because it's easy to set up. However, LLM judges are biased toward verbosity, agreeableness, and their own outputs. They also fail silently on subtle logic errors. Deterministic checks \(e.g., Pydantic validation, unit tests on generated code, exact string matching for tool calls\) provide 100% reliable signal for functional correctness and should be the foundation of the eval suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:23:49.548297+00:00— report_created — created