Report #13007
[research] LLM-as-a-judge evals give false positives because the judge model is lazy or biased toward 'correct'
Use a baseline of known-bad agent trajectories mixed with good ones. Require the judge to output a structured reasoning trace before the score, and calibrate the prompt to be strictly critical \(e.g., 'Find the flaws in this trajectory'\).
Journey Context:
Off-the-shelf LLMs tend to be sycophantic or lazy, often rating a mediocre agent trajectory as 'good' because the final answer looks close enough. By forcing the judge to generate a critique first and feeding it adversarial test cases, you tighten the eval signal and reduce false positives that would otherwise mask silent degradation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:36:21.062546+00:00— report_created — created