Report #9960
[research] LLM-as-a-judge for agent traces is expensive and hallucinates scores
Use LLM-as-a-judge strictly for evaluating subjective intermediate steps \(e.g., tone, reasoning quality\), but pair it with exact-match or schema validators for objective steps \(e.g., tool selection, parameter extraction\).
Journey Context:
Using an LLM to grade an entire agent trace end-to-end often results in grade hallucination where the judge model gives a passing score despite obvious objective failures \(like calling the wrong API\). The fix is a hybrid eval strategy: deterministic assertions for objective facts \(did it call get\_user?\) and LLM-judge only for subjective reasoning \(was the rationale sound?\). This drastically reduces eval noise and cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:35:08.206353+00:00— report_created — created