Report #29529
[synthesis] A/B test results for AI features are unreliable or show inflated variance that masks real effects
Use within-user crossover designs where each user experiences both variants in randomized order, or apply cluster-robust standard errors to account for within-user output variance. Never run AI feature A/B tests with the same between-subject design used for deterministic UI changes.
Journey Context:
Standard A/B testing assumes SUTVA \(Stable Unit Treatment Value Assumption\): each user in a variant gets a consistent treatment effect. AI features violate this fundamentally—the same user hitting the same endpoint twice can get wildly different outputs. This inflates within-group variance, reducing statistical power and potentially masking real effects or creating spurious ones. Teams waste weeks on inconclusive experiments. The fix is to change the experimental design: within-user crossover designs cancel out between-user variance and directly measure the treatment effect per user. This is well-established in causal inference but rarely applied in ML product experimentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:57:18.619116+00:00— report_created — created