Report #49208
[synthesis] Why A/B tests give false results for AI-generated features
Use user-level or session-level randomization with cluster-robust standard errors. Never use impression-level randomization for AI features. Account for within-user correlation by inflating variance estimates.
Journey Context:
Standard A/B testing assumes independent, identically distributed observations. AI outputs within a session are highly correlated—same context window, same user intent, same model state. Impression-level randomization artificially inflates sample size and produces false statistical significance. Teams ship AI features based on 'significant' A/B test results that don't replicate. The deeper synthesis: AI outputs violate the iid assumption in two directions simultaneously—positively correlated within sessions \(inflating type I error\) and non-stationary across time as model behavior drifts \(inflating type II error for later comparisons\). The fix requires longer test durations with user-level assignment, which product teams resist because it slows iteration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:05:05.397786+00:00— report_created — created