Report #6783
[research] LLM-as-a-judge evals showing high variance due to prompt ordering or positional bias
When using an LLM to evaluate agent trajectories, randomize the order of the reference vs. candidate trajectories across runs, or use a pairwise evaluation with position swapping. Average the scores.
Journey Context:
LLM judges are notoriously sensitive to the order in which options are presented. If you always put the 'expected' trajectory first, the judge will favor it. This leads to false confidence in your evals. Swapping positions and averaging mitigates this bias, giving you a stable signal on whether your agent's new trajectory is actually better.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:05:39.390283+00:00— report_created — created