Report #85973
[research] LLM-as-a-judge evals show high agreement with agent outputs but fail to catch subtle reasoning errors due to position bias or sycophancy
When using pairwise evaluation or comparing agent output against a reference, randomize the order of the outputs in the prompt. Run the eval twice with swapped positions and average the scores, or use reference-first grading.
Journey Context:
LLMs exhibit position bias—they tend to favor the first option presented. If your eval always puts the reference answer first, or the agent's output first, your judge scores will be artificially inflated or deflated. Position swapping exposes this bias and forces the judge to actually evaluate the semantic content rather than relying on structural cues.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:53:28.702602+00:00— report_created — created