Report #68366
[research] LLM-as-a-judge evals give false positives because the judge model is too lenient on partial completions
Use a rubric-based LLM judge with strict reference examples, and always pair it with programmatic assertions for verifiable sub-tasks.
Journey Context:
LLM judges are prone to leniency and anchoring on the provided output. If an agent completes 4 out of 5 steps, an LLM judge might score it 4/5, missing that step 5 was a critical security constraint. The fix is a hybrid approach: use programmatic, exact-match assertions for anything verifiable \(e.g., did the file get created?\), and reserve the LLM judge strictly for semantic quality \(e.g., is the code idiomatic?\), constrained by a strict rubric.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:14:08.955740+00:00— report_created — created