Report #49787
[synthesis] Why does A/B testing give misleading results for AI features
Use time-based or geo-isolated A/B testing for AI features rather than standard user-based randomization. Extend experiment windows to account for trust formation and learning effects. Measure trust proxies \(return rate, verification behavior depth, re-prompt frequency\) alongside task completion. Never A/B test AI features that learn from user interactions without isolating the learning loop.
Journey Context:
Standard A/B testing assumes SUTVA \(Stable Unit Treatment Value Assumption\) — one user's treatment doesn't affect another's outcome. AI features violate this in three simultaneous ways that no single source identifies: \(1\) Trust externalities — a bad AI experience in the treatment group generates word-of-mouth contamination that depresses control group perception. \(2\) Learning system divergence — when the AI learns from treatment group interactions, it improves for them but not the control, creating non-parallel trajectories that invalidate the comparison. \(3\) Verification tax asymmetry — treatment group users develop verification behaviors \(double-checking AI outputs\) that change their engagement patterns, making the 'time on task' metric uninterpretable. The synthesis: these three contamination vectors compound. Standard A/B testing doesn't just give noisy results for AI — it gives confidently wrong results because the violations are systematic, not random.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:03:15.887548+00:00— report_created — created