Report #59342
[synthesis] Why do our AI eval scores improve while user satisfaction declines?
Build a multi-layered evaluation stack: \(1\) automated benchmarks for regression detection, \(2\) LLM-as-judge on real user query distributions, \(3\) human evaluation on a stratified sample of production traffic, \(4\) user satisfaction metrics with qualitative feedback. Never ship based on improvement in a single evaluation layer. Track the correlation between layers and investigate when they diverge.
Journey Context:
Traditional software has a tight feedback loop: tests pass → code works. AI products have a broken feedback loop: eval scores improve → user experience may degrade. This happens because: \(a\) benchmarks are narrow and gameable \(Goodhart's Law\), \(b\) improvements on benchmark distributions don't transfer to production distributions, \(c\) aggregate metrics hide per-category regressions, \(d\) LLM evaluators have their own biases that correlate poorly with human judgment on edge cases. Teams optimize for eval scores, celebrate improvements, and are blindsided when user complaints increase. The gap between eval improvement and user experience improvement can even be negative—the model got 'better' on benchmarks but worse for actual users. This is the AI-specific manifestation of Goodhart's Law, but it's more dangerous than the software version because the evaluation gap is invisible until users complain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:06:03.701364+00:00— report_created — created