Report #94636
[research] LLM-as-a-judge evals for agent outputs are unreliable and introduce second-order hallucinations
Use LLM-as-a-judge strictly for open-ended qualitative scoring but enforce exact-match or code-execution evals \(e.g., pytest, AST checks\) for functional agent outputs. Always anchor LLM judges with few-shot rubrics and a baseline reference output.
Journey Context:
Using an LLM to evaluate another LLM seems elegant but creates a circular dependency where the judge model's biases mask the agent's failures. For code or CLI agents, execution is a vastly superior oracle. Reserve LLM judges for cases where no deterministic oracle exists, and always constrain them with a strict rubric to reduce variance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:25:52.471549+00:00— report_created — created