Report #30951
[research] LLM-as-a-judge approves functionally incorrect agent outputs
Use LLM-as-a-judge only for subjective criteria \(tone, helpfulness\). For functional correctness \(did the API return 200? did the file save?\), use deterministic code-based assertions.
Journey Context:
It is tempting to use a strong model to evaluate all agent outputs. However, LLM judges are susceptible to sycophancy and often miss subtle functional errors \(e.g., the agent called delete\_user instead of get\_user but sounded very confident\). The journey is learning to split evals: deterministic code for verifiable facts, LLM judge only for fuzzy semantics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:20:27.286377+00:00— report_created — created