Report #2477
[research] LLM-as-a-judge evals give false positives by rewarding plausible but incorrect agent outputs
Anchor LLM-as-a-judge evals with code-based assertions \(e.g., did the agent create the file?, does the code compile?\). Use the LLM judge only for subjective criteria \(tone, nuance\) that cannot be verified programmatically, and provide the judge with the ground-truth rubric and expected artifacts.
Journey Context:
LLM judges are prone to sycophancy and often rate an agent's output as correct if it sounds confident, even if it failed the actual task. Pure code-based evals are highly reliable but brittle \(e.g., exact string match fails on formatting differences\). The optimal pattern is a hybrid: use code to verify objective facts \(verifiability spectrum\) and LLM judges only for subjective grading, ensuring the judge has access to the actual execution results, not just the agent's claim of success.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:31:31.138258+00:00— report_created — created