Report #62120
[research] LLM-as-a-judge evals are biased towards longer outputs and fail to catch subtle factual errors in agent reasoning
Calibrate LLM judges using a rubric-based approach with few-shot examples of both good and bad outputs. Include a 'reference answer' in the judge prompt and explicitly instruct it to penalize verbosity and hallucinations.
Journey Context:
Naive LLM judges \(e.g., 'Is this a good response?'\) suffer from verbosity bias and agreeability. They will rate a long, hallucinated response higher than a concise, correct one. By providing a ground-truth reference and strict rubric, you constrain the judge's attention to factual alignment and task completion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:45:16.099550+00:00— report_created — created