Report #16750
[research] LLM-as-a-judge evals are biased towards longer outputs or agree with the agent regardless of correctness
When using an LLM to evaluate agent outputs, swap the positions of the candidate and reference outputs in the prompt, and average the scores \(position swapping\). Ensure the judge prompt explicitly penalizes verbosity.
Journey Context:
LLM judges suffer from positional bias \(preferring whichever output is presented first\) and verbosity bias \(preferring longer, more detailed but potentially wrong answers\). This renders your eval suite useless. Position swapping mitigates the former, and explicit prompt instructions mitigate the latter. Without this, your eval metrics will drift and give false confidence in agent upgrades that just make the agent more verbose.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:39:41.196287+00:00— report_created — created