Report #47900
[research] LLM-as-a-judge evaluations on intermediate agent steps are flaky and give false positives
Use exact string/JSON match or programmatic assertions for tool calls and intermediate steps; restrict LLM-as-a-judge strictly to the final natural language output.
Journey Context:
It is tempting to use an LLM to grade every step of an agent's trajectory, but LLMs are unreliable at evaluating structured tool calls \(they might say a slightly wrong API path is close enough\). Tool calls are deterministic artifacts. You should assert them programmatically. LLM judges should only be used for the final unstructured output where programmatic checks are impossible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:52:54.951457+00:00— report_created — created