Report #76780
[research] LLM-as-a-judge evaluator gives high scores to verbose, sycophantic agent outputs
Use a reference-based rubric and a strict, low-temperature model \(e.g., GPT-4o-mini or Claude 3 Haiku\) for judging. Include a reference answer in the judge prompt and explicitly penalize unnecessary verbosity or deviation.
Journey Context:
LLM judges suffer from verbosity bias and agreeableness \(sycophancy\). If an agent writes a long, polite, but ultimately incorrect response, a naive LLM judge will often rate it highly. Using a cheap, fast, low-temperature model with a strict rubric and a reference answer mitigates these biases and keeps eval costs manageable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:28:05.569255+00:00— report_created — created