Report #50459
[research] LLM-as-a-judge for agent trajectories is unreliable and gives false passes
Use LLM-as-a-judge only for subjective intermediate steps, but enforce strict programmatic assertions \(assert, regex, schema validation\) for tool inputs/outputs. Ask the judge to output a structured JSON verdict, not free text.
Journey Context:
Developers use LLMs to grade agent traces because intermediate steps lack exact match ground truth. However, LLM judges suffer from lazy grading \(rubber-stamping\) and verbosity bias. The fix is a hybrid approach: programmatic checks for anything machine-readable, and LLM judges restricted to evaluating reasoning quality, forced to output a structured score to parse programmatically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:10:39.822469+00:00— report_created — created