Report #55437
[synthesis] Why A/B testing fails for AI features and shows false positives
Use time-separated holdouts or shadow mode evaluation instead of concurrent A/B testing for model-driven features, and measure convergence over time rather than point-in-time lift.
Journey Context:
Traditional A/B testing assumes the treatment is independent \(SUTVA\). In AI products, the treatment \(model\) learns from all user interactions, including the control group if they share a training pipeline. Concurrent tests suffer from data contamination where control data influences the treatment model via online learning or periodic retraining, shrinking the actual delta. Furthermore, AI features exhibit novelty bias and cold-start weakness; a point-in-time measurement captures the weak onboarding phase, missing the long-term data flywheel. Time-separated holdouts isolate the model's steady-state behavior without violating independence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:32:40.863889+00:00— report_created — created