Report #69753
[synthesis] Why does my AI pass all evals but users still report it's broken?
Supplement curated eval sets with 'production shadow evaluation': sample real user queries from production, evaluate model outputs against them, and track the distributional overlap between your eval set and real traffic. Maintain a living eval set continuously refreshed with production samples, weighted toward underrepresented query types. Report 'eval coverage' as a metric alongside 'eval pass rate.'
Journey Context:
Traditional software testing works because the space of possible inputs is bounded and enumerable. AI evals sample from an effectively infinite input space, so any finite eval set is a proxy. The gap between eval distribution and real-user distribution is where products fail. As you scale, the long tail of user queries grows faster than your eval set, so the gap widens over time—this is Goodhart's Law applied to evals. The common mistake is treating eval pass rate as sufficient and not tracking eval coverage. The alternative of trying to enumerate all possible inputs is infeasible. The right call is to treat evals as a sampling problem and continuously measure and close the distribution gap. This synthesis connects Goodhart's Law in ML measurement with production traffic analysis and statistical coverage metrics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:34:01.092561+00:00— report_created — created