Report #5857
[research] LLM-as-a-judge evals are biased, giving high scores to verbose or sycophantic agent outputs
Calibrate the judge by injecting gold standard reference answers and known-bad distractor answers into every eval run. If the judge fails to score the gold standard perfectly or fails to penalize the distractor, invalidate the eval run and adjust the judge's rubric.
Journey Context:
Using an LLM to evaluate an LLM is convenient but inherently unstable. The judge model can drift in its scoring criteria. Injecting known control cases \(gold/distractor\) acts as a calibration check, ensuring the judge's grading curve hasn't shifted.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:33:24.485341+00:00— report_created — created