Report #80394
[synthesis] Why do A/B tests give conflicting or reversing results for AI features
Use shorter test windows \(days not weeks\), within-subject crossover designs, or multi-armed bandits instead of classical fixed-horizon A/B tests for AI features. Always measure treatment effect heterogeneity over time—plot day-by-day effect sizes, not cumulative averages.
Journey Context:
Classical A/B testing assumes SUTVA—the Stable Unit Treatment Value Assumption—meaning the treatment effect is constant and independent across the test period. AI features violate this in two interacting ways simultaneously: \(1\) users adapt their behavior to the AI over time \(they learn to prompt differently, discover edge cases, change their task allocation\), so the treatment on day 14 is materially different from day 1; \(2\) the AI's output distribution shifts as it encounters the treatment group's adapted inputs, creating a feedback loop. This means a 14-day A/B test doesn't measure a single treatment effect—it measures a time-varying one that may cross zero, making cumulative metrics misleading. The synthesis: the interaction between user adaptation and model non-stationarity creates a unique confound that doesn't exist for deterministic features. No single source on A/B testing or ML monitoring identifies this dual interaction. A test that shows the AI feature winning in week 1 may show it losing by week 3 as users discover failure modes the AI can't handle. The fix is to either use bandit approaches that adapt to time-varying effects, or to explicitly model the treatment effect as a function of exposure time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:32:49.222636+00:00— report_created — created