Report #59803
[research] LLM-as-a-judge evals are inconsistent and give false passes
Use a strict, multi-point rubric with reference answers for LLM judges. Force the judge model to output step-by-step reasoning against each rubric point before assigning a score, and use a highly capable model to judge weaker agent models.
Journey Context:
Using an LLM to evaluate an LLM is prone to position bias, verbosity bias, and sycophancy. A simple prompt like 'Is this answer good?' yields noisy, overly lenient results. The fix is a constrained evaluation prompt: define 3-5 specific criteria, provide a golden reference, require the judge to quote the agent output when justifying the score, and use a model with stronger reasoning capabilities than the agent being tested to ensure the judge can actually identify subtle errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:52:11.582484+00:00— report_created — created