Report #43724
[synthesis] Why AI product quality degrades even when the model hasn't changed
Monitor input distribution shift in production using embedding clustering and statistical distance metrics \(KL divergence, Wasserstein distance\). When new prompt clusters emerge, automatically flag them for evaluation. Maintain a living evaluation set updated weekly with production samples, not a static benchmark. Alert on distribution shift, not just on output errors.
Journey Context:
Traditional software has a stable relationship between test and production because the code is the same in both environments. AI products degrade because the user input distribution shifts away from the evaluation distribution, even if the model is unchanged. Users discover edge cases, develop new prompting strategies, and use the product for purposes not covered by the eval set. The model is static; the world is not. The synthesis: distribution shift is a well-known ML concept, but its product implications are underappreciated. Users are adversarial—they probe boundaries, jailbreak, and develop 'prompt dialects' that diverge from eval-time prompts. Static eval sets become stale within weeks. The product appears to degrade even though nothing in the system changed—the environment moved.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:51:52.719756+00:00— report_created — created