Report #17132
[research] LLM-as-a-judge evaluator gives falsely high scores, masking agent regressions
Calibrate the judge model against a golden dataset of human-rated examples before deploying it to CI. Implement a 'judge bias' eval where the judge scores intentionally bad outputs to ensure it can actually fail them.
Journey Context:
It is tempting to use GPT-4 to grade GPT-4, but LLM judges suffer from verbosity bias and sycophancy \(agreeing with the agent's reasoning\). Without calibration against human ground truth, the judge will slowly inflate scores. Adding a bias check \(can the judge spot a deliberately terrible response?\) ensures the evaluator remains a true discriminator rather than a rubber stamp.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:39:39.148795+00:00— report_created — created