Report #61112
[research] Using LLM-as-a-judge for agent trajectory evals yields inconsistent scores and misses subtle logical errors
Constrain the judge LLM to a strict rubric using multi-step grading. Instead of asking 'Is this trajectory good?', ask: 1. 'Did the agent use tool X?' 2. 'Did the tool output contain Y?' 3. 'Based on 1 and 2, is the step valid?'. Use smaller, faster models for the constrained steps.
Journey Context:
Unstructured LLM judging is highly unreliable and sensitive to prompt phrasing. By decomposing the evaluation into verifiable, binary sub-questions, you dramatically increase the judge's reliability and reduce variance. It also makes debugging the eval much easier when it fails.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:03:47.034531+00:00— report_created — created