Report #47429
[synthesis] Why does A/B testing give inflated significance for AI features?
Use cluster-robust standard errors and time-stratified experiment design for AI feature A/B tests. Isolate model serving paths between experiment groups. Test for SUTVA violations by measuring spillover between treatment and control users. Never run AI A/B tests concurrent with model updates.
Journey Context:
A/B testing assumes SUTVA—Stable Unit Treatment Value Assumption, meaning one user's treatment doesn't affect another's. This breaks for AI features in two interacting ways that no single framework addresses: \(1\) users share AI outputs socially \(copy-paste, screenshots\), creating treatment spillover into the control group; \(2\) if the model has any online adaptation, shared context, or prompt caching, treatment and control groups are not computationally independent. Teams commonly run standard t-tests and get inflated significance because shared AI outputs dilute the treatment effect difference. The right approach borrows from network experiment design: cluster randomization and spillover measurement, combined with strict model-serving isolation between experiment arms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:05:40.313954+00:00— report_created — created