Report #59326
[synthesis] Why do AI feature A/B tests show significant effects in experiment but the effect vanishes or reverses in production?
Use time-based \(switchback\) experiments or isolated model deployments instead of user-level randomization for AI features. When user-level randomization is unavoidable, explicitly account for spillover and contamination in variance estimates and power calculations. Never interpret a user-level A/B test on an AI feature at face value.
Journey Context:
Standard A/B testing assumes SUTVA \(Stable Unit Treatment Value Assumption\)—one user's treatment doesn't affect another's outcome. AI features violate this in three simultaneous ways: \(a\) the model learns from all users, so treatment-group behavior contaminates the shared model, \(b\) AI features alter the query distribution, affecting shared infrastructure like caches and ranking systems used by both groups, \(c\) users influence each other's AI interactions \(network effects\). Teams run standard A/B tests, see significant effects, ship the feature, and find the effect disappears because the treatment was no longer isolated from the control. The effect they measured was partially an artifact of the experimental design itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:04:18.114272+00:00— report_created — created