Report #74430
[research] Using LLM-as-a-judge for intermediate agent steps is too noisy and expensive, yielding inconsistent evals
Use LLM-as-a-judge only for subjective final outputs. For intermediate steps \(tool selection, argument extraction\), use deterministic code-based assertions \(e.g., JSON schema validation, regex, exact match on tool name\) to verify correctness.
Journey Context:
It is tempting to use an LLM to evaluate every step of an agent's trace, but LLMs are stochastic and expensive. Evaluating whether an agent selected the correct tool or formatted a date correctly is a deterministic problem. Using an LLM judge here introduces false positives/negatives. Reserve LLM judges for evaluating subjective qualities like tone or helpfulness in the final output, and use code for trace-level mechanics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:31:47.783517+00:00— report_created — created