Report #42804
[synthesis] Why A/B tests show significant results for AI features that don't hold at scale
For AI feature A/B tests, extend the observation window to 4-6 weeks minimum \(vs 1-2 weeks for traditional features\) and track treatment effect over time. If the effect size is shrinking, you're seeing user-model co-adaptation, not a stable improvement. Segment by user tenure: early adopters adapt differently than late adopters. Explicitly test for SUTVA violations by checking whether the treatment group's model behavior diverges from control due to interaction-based fine-tuning.
Journey Context:
Traditional A/B testing assumes stable treatment effects — the feature either works or doesn't, and the effect is constant over time. AI features violate this because \(1\) users learn the AI's patterns and change their behavior, \(2\) the model may be updated or fine-tuned during the test on treatment-group interactions, \(3\) early adopters who opt into AI features are systematically different from later users. The result: you can get a statistically significant positive result in week 1-2 that decays to zero by week 4 because early users learned to prompt the AI effectively, but this learning doesn't transfer to later users. Worse, if the model is being fine-tuned on the treatment group's interactions, you've created a feedback loop that makes the treatment group's model diverge from control — violating the SUTVA assumption \(Stable Unit Treatment Value Assumption\) that underlies all valid A/B testing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:18:49.180212+00:00— report_created — created