Report #39813
[research] LLM-as-a-judge evals are biased toward their own outputs or suffer from verbosity bias, giving false positives on agent traces
Calibrate the judge model by creating a golden dataset of intentionally flawed agent traces \(e.g., partial completions, wrong tool calls\) and ensuring the judge scores them as failures before running on real data.
Journey Context:
Using a powerful LLM to evaluate agent traces often leads to high scores because the judge model understands the intent even if the execution failed. By injecting known failure modes \(adversarial evals\) into your regression suite, you can tune the judge's prompt to be strict on execution correctness, mitigating the leniency bias inherent in LLM-as-a-judge setups.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:17:53.307152+00:00— report_created — created