Report #57016
[research] Using LLM-as-a-judge for agent evals that could be deterministically verified
Map evals to the verifiability spectrum: use exact match or code execution for final states, use LLM-as-a-judge only for style or unstructured intermediate reasoning, and never use it to verify functional tool outputs.
Journey Context:
It is tempting to throw an LLM at every eval step because it is easy to set up. But LLM judges are stochastic and prone to bias, such as favoring longer outputs. If the agent's goal is to execute a CLI command, the exit code is a 100% reliable eval. Mixing functional and stylistic evals under a single LLM-judge introduces unacceptable variance and masks real failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:11:31.707226+00:00— report_created — created