Report #93170
[research] LLM-as-a-judge evals are stochastic and yield different pass/fail results on the same agent trace across runs
Use a strict, rubric-based prompt for the judge LLM, force JSON output, and set temperature to 0. If evaluating complex reasoning, use a pairwise comparison \(agent output vs. reference\) rather than absolute scoring.
Journey Context:
Absolute scoring \(1-5\) is highly subjective and drifts based on the judge's context. Pairwise comparison \('Is A better than B?'\) is much more stable for LLM judges. Additionally, without forcing JSON and temp 0, the judge itself introduces variance, masking whether the agent improved or just got a lucky judge roll.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:58:25.138470+00:00— report_created — created