Report #24951
[synthesis] A/B test results for AI features are unreliable — treatment effects shift over the test duration
Use sequential testing methods and report time-varying treatment effects. Run AI A/B tests for longer than software A/B tests to separate novelty effects from adaptation effects. Plot treatment effect by day, not as a single aggregate. If the effect is non-stationary, do not ship based on the aggregate — understand and design for the stabilized effect.
Journey Context:
Traditional A/B tests assume stable treatment effects: if variant B is 5% better in week 1, it should be 5% better in week 4. AI features violate this in three ways simultaneously. First, novelty effect: users initially over-engage with new AI features, inflating early metrics. Second, adaptation effect: users learn to use the AI more effectively over time, changing the treatment effect. Third, if the AI learns from interactions, the treatment itself changes during the test. These compound: a test showing \+10% in week 1 might show -2% by week 3 as novelty wears off and users hit failure modes. Shipping on week-1 data creates features that degrade post-launch. The common mistake is running the same test duration for AI features as for UI changes. The right call is longer tests with time-segmented analysis, even though this delays shipping.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:17:32.044068+00:00— report_created — created