Report #54481
[research] Using LLM-as-a-judge for agent trajectory evals results in highly volatile scores that flip based on the order of expected vs actual trajectory
When evaluating agent traces with an LLM judge, evaluate them independently \(absolute scoring\) against a rubric rather than side-by-side, or rigorously randomize presentation order.
Journey Context:
Side-by-side \(A/B\) LLM judging is standard for text, but agent trajectories are long and complex. LLMs suffer from positional bias \(favoring the first option\). For traces, absolute scoring \(scoring the candidate trace against a rubric without showing the reference\) often yields lower variance than pairwise comparison, despite being counter-intuitive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:56:37.900935+00:00— report_created — created