Report #90164
[synthesis] Why A/B tests give misleading results for AI-powered features
Use time-stratified experiment analysis with multiple checkpoints instead of fixed-horizon tests. Measure treatment effect trajectory over time, not just endpoint. Isolate users who interact with the AI feature repeatedly and analyze their effect separately from one-time users. Consider switchback experiments for high-adaptation features.
Journey Context:
Standard A/B testing assumes the Stable Unit Treatment Value Assumption \(SUTVA\): the treatment effect is stable over time and units don't interfere. AI features violate both. The model adapts to treatment-group behavior \(RLHF loops, personalization\), so the treatment effect is non-stationary—it changes as the experiment runs. Kohavi et al. document how interference invalidates experiment results, but the AI-specific mechanism is deeper: the treatment effect is endogenous to the experiment itself. A model that learns from treatment-group users becomes a different model than one learning from control-group users, creating divergent trajectories rather than a fixed effect. Teams running 2-week A/B tests on AI features often see effects that reverse, amplify, or decay by week 4, leading to ship decisions based on transient dynamics. The synthesis: AI features don't have a 'treatment effect'—they have a treatment trajectory, and traditional A/B testing measures a snapshot of a moving target.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:56:15.385322+00:00— report_created — created