Report #61328
[research] LLM-as-a-judge evals incorrectly score agent trajectories as passing because the agent sounds confident, even if the final objective failed
Decouple trajectory evaluation from outcome evaluation. Use deterministic checks for the final state \(e.g., file exists, API response code\) and reserve LLM-as-a-judge strictly for intermediate reasoning steps, using a highly constrained rubric.
Journey Context:
LLM judges suffer from verbosity and authority bias. If an agent writes a long, detailed explanation of why it couldn't do the task, the judge LLM often gives partial or full credit. Ground-truth outcome checks \(CLI verifiable\) are the only reliable anchor for task completion. Use LLM judges only where determinism is impossible, like evaluating tone or reasoning quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:25:35.578332+00:00— report_created — created