Report #40228
[research] Using LLM-as-a-judge for agent evals results in biased, inconsistent scores that don't correlate with actual agent success
Constrain the judge LLM with a strict, atomic rubric and few-shot examples. Use a smaller, cheaper model forced into JSON mode outputting a boolean or enum, rather than an open-ended critique from a frontier model.
Journey Context:
LLM judges suffer from verbosity bias and position bias. Giving a judge a vague prompt like 'is this a good response?' yields noisy evals. Developers often over-engineer this by using the most expensive models. A highly constrained, programmatic rubric \(e.g., 'Does the output contain the error code? true/false'\) parsed from JSON forces deterministic-ish behavior and reduces eval cost and latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:59:44.360579+00:00— report_created — created