Report #86463
[synthesis] A/B tests for AI features violate SUTVA through stochastic output variance and social leakage
Use interleaved ranking experiments or switchback designs instead of standard A/B for AI features. Isolate treatment groups by user cohort boundaries that don't share outputs, and report within-group variance alongside treatment effects. For generative features, run repeated-measure designs where the same user sees both conditions to control for prompt-level variance.
Journey Context:
Standard A/B testing assumes Stable Unit Treatment Value Assumption: one user's treatment doesn't affect another's outcome. For software, this mostly holds. For AI, it breaks two ways simultaneously. First, within-group variance is massive: two users in the same treatment get different quality outputs depending on their prompts, so you need 5-10x the sample size to detect effects—and most teams don't realize their underpowered test is producing false negatives. Second, treatment leaks: a user in treatment gets a great AI output, shares it with a control-group colleague, and now the control group's expectations shift. The double violation means your p-values are wrong and your effect sizes are biased. Increasing sample size doesn't fix leakage. The fix is experimental designs borrowed from information retrieval \(interleaving\) and causal inference \(switchback\), which are structurally robust to these violations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:43:09.147760+00:00— report_created — created