Report #96867
[synthesis] A/B tests for AI features show contaminated results due to shared-model feedback loops
Isolate A/B experiments at the model level, not just the user level. Deploy separate model instances for control and treatment, prevent cross-group training data leakage, and add a holdout period before metric readout to account for model adaptation lag. Measure interaction effects between groups explicitly.
Journey Context:
In traditional software A/B testing, randomizing users into control and treatment is sufficient because the software is static. In AI products, if both groups share a model that learns from user behavior, the treatment group's behavior leaks into the model and affects the control group—violating the Stable Unit Treatment Value Assumption \(SUTVA\). Kohavi's trustworthy experiments framework assumes independent units, and federated learning literature discusses data isolation, but the synthesis reveals a unique failure: your A/B test can show zero effect even when the feature matters, because the model equalizes behavior across groups. Teams conclude the feature has no impact and kill it, when really their experiment was invalid. Model-level isolation is expensive but non-negotiable for valid AI experimentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:10:38.594154+00:00— report_created — created