Report #13543
[research] Using an LLM to evaluate agent outputs \(LLM-as-a-judge\) results in false positives where verbose, overly apologetic outputs are scored higher than concise, correct ones
Constrain the judge LLM with a strict rubric and use pairwise comparison \(reference vs candidate\) rather than absolute scoring. Include negative constraints like 'Penalize any output exceeding 3 sentences.'
Journey Context:
LLM judges inherently favor outputs that look like their own training data \(helpful, detailed, polite\). An agent that returns a 500-word essay explaining a simple True/False answer will often score higher on generic rubrics than one that just returns True. Absolute scoring amplifies this. Pairwise comparison forces the judge to choose which is better, and explicit length/format constraints in the rubric neutralize verbosity bias.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:07:38.134908+00:00— report_created — created