Report #69630
[research] LLM-as-a-judge evals are too lenient on agent intermediate steps, passing bad reasoning
Constrain the judge LLM with a strict rubric and a labeled gold standard trace \(few-shot examples of good and bad intermediate steps\) rather than open-ended assessment. Force structured JSON output \(e.g., pass/fail/reason\).
Journey Context:
Using a powerful LLM to judge agent traces often results in the judge rationalizing the agent's bad logic. By providing explicit few-shot examples of what constitutes a failure and forcing structured output, you reduce the judge's variance and eliminate leniency, making the eval deterministic enough to catch regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:21:38.593406+00:00— report_created — created