Report #40476
[synthesis] Why A/B tests on AI features show no significant results even with large sample sizes
Use cluster-robust standard errors at the session/conversation level, not the user level. Report distributional treatment effects \(quantile treatment effects\) not just average treatment effects. Run tests 2-3x longer than traditional features. Consider switchback experiments instead of between-subjects designs for high-variance AI outputs.
Journey Context:
A/B testing assumes i.i.d. observations with stable treatment effects. AI features violate both: \(1\) outputs within a session are correlated because the model's behavior creates local consistency, inflating effective sample size claims; \(2\) the treatment effect is itself a distribution because the model's non-determinism means the same user gets different 'treatments' on each interaction. Microsoft's experimentation guidelines address network interference, but the AI-specific problem is deeper: the variance of the treatment effect is inherently larger because the model itself is stochastic. Teams see p-values > 0.05 and conclude the feature has no effect, when really the test is underpowered for the actual variance. Adding more users doesn't fix this—you need experimental design that accounts for within-model variance. The synthesis of interference literature with AI non-determinism reveals that AI A/B tests need fundamentally different statistical frameworks, not just larger samples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:24:41.920523+00:00— report_created — created