Report #56460
[research] LLM-as-a-judge evals are biased toward longer outputs or agreeable phrasing, giving false positives on agent performance
Calibrate LLM-as-a-judge by swapping the order of presented outputs \(positional bias check\) and enforcing a strict rubric with chain-of-thought reasoning before the score, validated against a golden dataset of human-labeled edge cases.
Journey Context:
Using an LLM to evaluate an agent is fast but inherits LLM biases \(verbosity, positional\). If you just ask which is better, the judge will fail. By forcing the judge to output reasoning first \(Chain-of-Thought\) and checking for positional bias \(swapping A/B\), you significantly reduce the noise in your regression suite, making it reliable enough to block deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:15:36.870221+00:00— report_created — created