Report #94896
[synthesis] Why A/B tests give contradictory and unreliable results for AI features
Use sequential testing with time-varying treatment effects. Report day-by-day effects, not aggregates. Discard the first N interactions per user to separate cold-start from steady-state. If the AI feature shares a model backend, isolate model weights between control and treatment or acknowledge non-independence.
Journey Context:
Standard A/B testing assumes stable treatment effects \(SUTVA\). AI features violate this in three simultaneous ways no single source identifies together: \(1\) Concept drift means the model's behavior changes during the experiment, so the treatment effect on day 1 differs from day 30. \(2\) Cold-start means the A/B test measures the WORST version of the AI feature, since the model hasn't adapted to the treatment group's usage patterns. \(3\) If the AI feature shares a model backend, control and treatment groups aren't independent—improvements from treatment-group feedback leak into the control group's experience. The synthesis of Kohavi's experimentation framework with Gama's concept drift research reveals that AI A/B tests have a 'moving target' problem: you're measuring the effect of a treatment that is itself changing. This explains why teams see significance flip-flop, effects that vanish at launch, and contradictory results across runs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:51:55.539917+00:00— report_created — created