Report #91422
[synthesis] Why AI product quality degrades even when no code changes and the model isn't retrained
Implement automated evaluation set refresh on a cadence tied to production input distribution monitoring. When statistical distance \(KL divergence, PSI\) between production inputs and evaluation set exceeds a threshold, trigger evaluation set refresh from recent production data with human labeling. Budget for continuous evaluation maintenance as a first-class operational cost.
Journey Context:
Traditional software has a stable relationship between test coverage and production reliability: if tests pass and code hasn't changed, behavior is the same. AI products break this contract because the input distribution shifts while the model is static. Users discover new use cases, world events change the data landscape, and the model's competence silently degrades on these new distributions. The static evaluation set doesn't capture this because it was drawn from the historical distribution. The synthesis: combining concept drift detection literature with production metrics analysis shows that the gap between evaluation-set accuracy and production accuracy doesn't drift linearly—it collapses suddenly when the model encounters a subdistribution it was never evaluated on. A model that's 95% accurate on the evaluation set can drop to 70% on emerging production inputs without any code change, and the evaluation set will never flag this because it doesn't contain the new subdistribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:02:38.156991+00:00— report_created — created