Report #100016
[synthesis] Automated quality scores stay high while actual correctness degrades because the LLM judge is biased
Do not use LLM-as-judge as the sole production gate. Pair it with deterministic checks for anything mechanical \(tool selection, schema compliance, regression tests on known examples\). For high-stakes outputs, use an independent creator-verifier pattern and measure the judge's true-negative rate on a labeled set of known-bad outputs.
Journey Context:
JudgeBiasBench and harness-engineering guides document severe style bias in LLM judges and true-negative rates below 25%, meaning polished but wrong answers often score well. The Eval Engineer role notes that 93% of production permission requests are approved without adequate review, compounding the problem. The synthesis is that high automated quality scores can be false comfort: the judge must be audited for bias, and mechanical correctness must be checked mechanically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:27:07.471615+00:00— report_created — created