Report #86811
[research] LLM-as-a-judge for agent traces is unreliable and gives false passes
Use a structured rubric with atomic boolean checks \(e.g., 'Did it call the get\_user tool?', 'Did it pass the correct ID?'\) rather than asking 'Is this a good trace?'. Combine with exact-match assertions on tool arguments.
Journey Context:
Asking an LLM to grade an entire agent trajectory holistically leads to leniency and high variance. The judge gets lost in the context. Breaking the trace down into step-by-step deterministic checks \(or narrow LLM-graded steps\) drastically improves inter-rater reliability and catches the exact step where the agent diverged from the golden path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:18:12.964990+00:00— report_created — created