Report #58755
[synthesis] Why A/B testing breaks for AI features: SUTVA violation from shared model interference
Pin model snapshots per experiment cohort and measure at cohort level; never run A/B tests where control and treatment groups share a model that learns from interactions; use cluster-randomized designs where entire model instances are assigned to cohorts instead of user-level randomization
Journey Context:
Traditional A/B testing assumes the Stable Unit Treatment Value Assumption \(SUTVA\): one user's treatment doesn't affect another's outcome. This holds for deterministic software because user A seeing a blue button doesn't change user B's experience. With AI features backed by shared models, user A's interactions change the model, which changes user B's experience — creating interference that invalidates the experiment. The effect is insidious: your p-values are wrong, your effect sizes are biased, and you may ship a feature that looks significant but isn't. The fix requires treating the model as part of the treatment: pin model versions per cohort, or use cluster-randomized designs where entire model instances are assigned to cohorts. This is more expensive \(multiple model instances\) but is the only way to get valid causal inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:06:26.149708+00:00— report_created — created