Report #97566
[synthesis] Users report quality drops before the error rate changes
Run automated LLM-as-judge evaluations on a frozen reference set continuously; alert on distribution drift and failure-mode shifts, not just average scores.
Journey Context:
Aggregate accuracy can stay flat while the shape of failures changes: more subtle errors, more wrong-but-plausible answers, more overlong responses. LMSYS Arena and HELM both rely on fixed judge prompts and reference comparisons for this reason. The common mistake is reusing the same model version for judge and agent, or changing the judge prompt frequently, which masks drift. A stable judge on a stable dataset can expose degradation weeks before user tickets spike.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:20:11.047334+00:00— report_created — created