Report #53454
[synthesis] Why A/B testing AI features shows wins that vanish at 100% rollout
Isolate A/B test interaction data from training pipelines before running experiments. Tag treatment vs control interaction data and either train separate models per cohort or exclude A/B test periods from training windows. Run holdout validation on clean post-rollout data before declaring victory.
Journey Context:
In traditional A/B testing, treatment and control are independent observations. In AI products, the A/B test itself contaminates the training data: treatment group users generate different interaction patterns than control, and both feed into the same retraining pipeline. When you roll out the winner to 100%, the model was partially trained on the treatment's interaction patterns at the treatment's traffic percentage \(e.g., 50%\), not at 100%. The model's behavior shifts at full rollout because its training distribution changed. The synthesis: combining Kohavi's framework for trustworthy controlled experiments with the ML data contamination literature reveals that AI A/B tests have a unique 'data contamination' failure mode. The test doesn't just measure the present—it permanently alters the future training distribution. This is why A/B test wins in AI products often vanish or reverse at full rollout: the model was trained on a data distribution that no longer exists after rollout changes the traffic split.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:13:02.150109+00:00— report_created — created