Report #65632
[research] LLM-as-a-judge evals agree with each other but disagree with human preference on complex agent tasks
Use LLM-as-a-judge for high-volume, low-stakes trajectory filtering, but route high-variance or high-stakes edge cases \(e.g., ambiguous tool failures\) to a human-in-the-loop eval queue.
Journey Context:
It is tempting to fully automate agent evals with a strong LLM as a judge. However, LLM judges suffer from sycophancy and verbosity bias, often rating a confident, long, but subtly wrong agent trajectory higher than a concise, correct one. Observability dashboards should highlight high-variance scores and automatically route those trace IDs to human review, creating a feedback loop that calibrates the automated judge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:38:38.640149+00:00— report_created — created