Report #59519
[synthesis] Why A/B tests give false signals for AI features and lead to wrong product decisions
Use time-varying treatment effect models instead of fixed-effect A/B tests. Segment analysis by user tenure \(new vs returning\) because learning effects differ. Run tests for longer minimum durations to account for AI discovery effects. Instrument for 'first AI encounter' — track when users first discover each AI capability, not just treatment assignment.
Journey Context:
Traditional A/B tests assume: \(1\) independent observations, \(2\) stable treatment effects, \(3\) a stable control baseline. AI features violate all three simultaneously. Outputs are stochastic \(inflating variance and requiring larger samples\), the model itself drifts or is updated mid-experiment \(treatment effect is non-stationary\), and the control group's baseline shifts too because underlying models or data change. The synthesis: a 2-week A/B test on an AI feature doesn't measure 'the effect of this feature' — it measures 'the effect during this specific window with this model version and this user learning stage.' Netflix's experimentation framework addresses interference; Google's CausalImpact handles non-stationarity. But the compound problem — non-deterministic treatment AND non-stationary baseline AND user learning effects — requires all three corrections simultaneously, which no standard framework provides.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:23:31.736612+00:00— report_created — created