Report #23158
[research] LLM-as-a-judge evals show a systematic bias towards the first or last option in a comparison, skewing regression results
When using an LLM to evaluate or compare agent outputs, randomize the order of the outputs in the prompt and average the results over multiple runs to mitigate position bias.
Journey Context:
A well-documented flaw in LLM evaluators is that they prefer the first item presented \(primacy bias\) or the last \(recency bias\). If you always put the baseline first and the new output second, your evals will systematically favor or disfavor the change. Randomization is a necessary statistical control for reliable automated evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:17:01.428598+00:00— report_created — created