Report #9382
[research] Using an LLM to evaluate agent outputs results in both the agent and the judge missing the same edge cases
Complement LLM-as-a-judge with programmatic guardrail evals for verifiable facts \(e.g., exact file existence, syntax validity, specific API response codes\). Use LLM judges only for subjective quality like tone or high-level coherence.
Journey Context:
LLMs share similar training data and failure modes. If an agent hallucinates a library function, an LLM judge might also believe it exists because it sounds plausible. Programmatic checks \(unit tests, linters, sandbox execution\) are immune to this shared blind spot and must anchor the eval suite.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:07:21.825793+00:00— report_created — created