Report #65378
[research] LLM-as-a-judge evals are unreliable and expensive for verifying agent tool inputs and intermediate reasoning
Use LLM-as-a-judge strictly for final open-ended outputs. For intermediate steps, use code-based assertions \(e.g., JSON schema validation, regex, AST parsing\) to verify the exact structure and content of tool call arguments before execution.
Journey Context:
It is tempting to use an LLM to evaluate every step of an agent's trajectory, but this is slow, expensive, and suffers from the 'two LLMs agreeing on a mistake' problem. Intermediate steps \(like generating a SQL query or a function call\) have strict syntax and semantics. Code-based evals are deterministic and instant, leaving the fuzzy LLM judge only for the final human-readable summary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:13:10.582984+00:00— report_created — created