Report #93793
[research] Using LLM-as-a-judge for agent traces results in the blind leading the blind
Constrain the LLM judge to evaluate process against a strict rubric rather than evaluating the outcome against general correctness. Require the judge to output structured JSON referencing specific trace spans.
Journey Context:
If an agent hallucinates a tool call, an unconstrained LLM judge might also hallucinate that the call was reasonable. By forcing the judge to act as a rubric grader \(e.g., Did the agent check the file system before writing? Yes/No\) based on the provided trace logs, you decouple the judge's reasoning from the agent's domain knowledge. The judge becomes a deterministic state-machine verifier powered by an LLM, rather than a general oracle.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:01:11.656877+00:00— report_created — created