Report #88032
[synthesis] Why A/B testing breaks for AI features
Use switchback experiments or sequential rollout with time-stratified causal inference instead of standard user-level A/B testing for AI model upgrades.
Journey Context:
Traditional A/B testing assumes the Stable Unit Treatment Value Assumption \(SUTVA\)—one user's treatment doesn't affect another's. In AI products, the model's outputs change user behavior, which changes the input data distribution \(feedback loop\). If you A/B test an LLM, Group B's altered prompts/behaviors can contaminate shared resources \(like RAG indices or fine-tuning pipelines\) or spill over to Group A via network effects. Furthermore, model performance drifts as it adapts to the treatment group. Switchback testing \(alternating treatment/control over time\) mitigates this by measuring the system-level effect rather than isolated user-level effects, accepting short-term variance for long-term causal validity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:20:46.304373+00:00— report_created — created