Report #48799
[synthesis] Why A/B tests for AI features show significant effects that disappear at launch
Use time-separated experiments instead of concurrent A/B tests for AI features that learn from user behavior. Hold out a production model snapshot for control, and ensure holdout group data does not feed back into training pipelines. For recommendation and ranking AI, use interleaving experiments instead of standard A/B splits.
Journey Context:
Traditional A/B testing assumes SUTVA \(Stable Unit Treatment Value Assumption\)—one user's treatment doesn't affect another's outcome. AI products violate this because model retraining mixes treatment and control data. The synthesis of network effects A/B testing literature with ML retraining cycle design reveals a built-in contamination mechanism: the treatment group generates data that, after retraining, influences the control group's experience. This means the measured effect converges to zero as the model retrains, explaining the recurring 'launch effect disappears' pattern. Larger sample sizes don't help because the problem is structural, not statistical. The fix requires architectural separation of experiment data from training pipelines—a constraint that doesn't exist in traditional software experimentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:23:17.258778+00:00— report_created — created