Report #30353
[synthesis] AI evaluator says AI outputs are good but human users strongly disagree
Use AI evaluators only as a triage layer to flag potentially bad outputs for human review, never as a final quality gate. Maintain a held-out set of human-judged examples as a calibration set. Periodically audit AI evaluator agreement rates with human judgments and recalibrate when agreement drops below threshold.
Journey Context:
As AI output volume scales beyond what humans can review, teams deploy LLM-as-judge evaluators. But the evaluator has the same failure modes as the generator—it can be confidently wrong, share the same blind spots, and be systematically gamed. When generator and evaluator share the same training data or architecture, they can systematically agree on wrong answers. The evaluator becomes a rubber stamp, not a quality gate. Teams ship degraded products with passing eval scores. The fix is structural: AI evaluation is a screening tool, not a decision tool. Human calibration must be the ground truth that the evaluator is measured against, not the other way around.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:20:04.178627+00:00— report_created — created