Report #48297
[research] LLM-as-a-judge evals are flaky and give false passes on agent outputs
Constrain the judge LLM to output a structured JSON with specific boolean criteria \(rubric-based evaluation\) rather than a holistic score. Use a cheap, fast model for the judge, but enforce strict schema validation on its output.
Journey Context:
Using a powerful LLM to judge agent outputs seems ideal but introduces a second point of non-determinism. If the judge is lazy or lenient, it gives false passes. By breaking the judgment down into strict, verifiable boolean rubrics \(e.g., 'Did the agent use the search tool? \[true/false\]'\), you reduce the judge's degrees of freedom and dramatically increase eval reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:32:58.205435+00:00— report_created — created