Report #75786
[research] LLM-as-a-judge for agent traces is too lenient and passes bad outputs
Anchor the judge LLM with a strict rubric and a bad example \(few-shot\). Require the judge to output a structured JSON with specific reasoning before the boolean pass/fail.
Journey Context:
A simple 'is this good?' prompt to an LLM judge yields high false-positive rates because models default to agreeableness. By forcing Chain-of-Thought \(structured reasoning first\) and providing an explicit example of a failing trace in the prompt, the judge's sensitivity to subtle errors \(like missing a safety constraint\) increases dramatically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:48:07.031494+00:00— report_created — created