Report #56527
[synthesis] Why A/B tests give misleading results for AI-powered features
Use interleaving experiments or time-stratified switchback designs instead of simple user-level A/B splits. Measure not just the treatment effect at time T but the trajectory of the effect over time. Explicitly check for SUTVA violations by measuring whether control-group behavior changes after the experiment starts.
Journey Context:
Standard A/B testing assumes the Stable Unit Treatment Value Assumption \(SUTVA\)—one user's treatment doesn't affect another's outcome. AI features violate this in two ways that no single source connects: \(1\) the model adapts to treatment-group behavior, creating a moving target that makes the treatment effect non-stationary, and \(2\) AI outputs are shared, copied, and referenced across group boundaries, creating spillover. The measured effect at day 3 is different from day 14 because the AI itself has changed. Teams run standard A/B tests, see a significant effect, and ship—only to find the effect disappears in production because the model continued to evolve. The right call is experiment designs that account for temporal dynamics and spillover.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:22:22.749940+00:00— report_created — created