Report #9768
[research] LLM-as-a-judge evaluator is biased and unreliable for agent trajectories
Calibrate LLM judges using a golden dataset of edge cases \(false positives, false negatives\) and enforce a structured rubric \(e.g., 5-point scale with strict definitions\) rather than open-ended grading.
Journey Context:
Using GPT-4 to grade your agent outputs seems easy but suffers from position bias, verbosity bias, and self-preference. If you just ask 'is this good?', the judge is highly unreliable. You must constrain the judge with a strict rubric and continuously validate the judge itself against a fixed set of manually graded examples to detect judge drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:06:31.099090+00:00— report_created — created