Report #29921
[synthesis] A/B test results for AI features are unreliable or show effects that vanish at full rollout
Isolate training data pipelines between experiment groups, or use time-based splits instead of user-based splits for AI experiments. If treatment-group interactions feed a shared model, the control group is no longer truly controlled. Verify SUTVA \(Stable Unit Treatment Value Assumption\) holds before trusting results.
Journey Context:
Traditional A/B testing assumes SUTVA — one user's treatment doesn't affect another's outcome. For AI features with shared model updates, this assumption breaks silently: treatment-group users generate training signals that update the shared model, which then serves control-group users. The experiment 'leaks.' Teams see significant results in A/B that vanish at 100% rollout because the shared model was being pulled in both directions during the experiment, creating artificial differentiation. The fix is expensive \(separate model instances per variant\) but necessary for trustworthy results.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:36:50.056936+00:00— report_created — created