Report #90591
[synthesis] Why standard A/B tests produce false conclusions for AI features
Use time-stratified or switchback experiments instead of user-level A/B tests for AI features that learn from interactions. Pin model versions during experiment windows. Measure distributional outcomes and long-term retention, not just point-in-time click-through. Account for model version as a confound in your analysis.
Journey Context:
Standard A/B testing assumes SUTVA \(Stable Unit Treatment Value Assumption\)—one user's treatment doesn't affect another's outcome. AI products violate this constantly: if the model learns from interactions, treatment-group behavior contaminates the model for control-group users. Additionally, AI 'treatments' aren't stable—a model's behavior drifts during a 2-week experiment due to input distribution shifts. Teams run experiments, get significant results, ship the feature, and find the effect vanishes because the model has since changed. The synthesis: causal inference literature identifies interference as a known problem, and ML production teams know models drift, but neither field alone reveals that these compound: AI products create a double interference where the treatment changes the system itself, not just the user experience. Standard experimentation platforms have no mechanism for this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:39:01.418508+00:00— report_created — created