Report #30339
[synthesis] A/B test shows no significant effect for AI feature but the feature actually matters
Increase sample sizes 3-5x beyond traditional power calculations. Stratify analysis by query type and interaction depth. Run tests for longer durations to account for output variance. Consider interleaving experiments instead of standard A/B splits for AI features.
Journey Context:
Standard A/B testing assumes relatively stable treatment effects with manageable variance. AI features have massive output variance—the same input can yield wildly different responses across runs. This inflates within-group variance and destroys statistical power. Teams conclude 'feature doesn't work' when they simply couldn't detect the signal through the noise. Worse, AI features often have heterogeneous effects: transformative for some query types, neutral for others. Aggregated analysis washes out the signal. The non-determinism of AI outputs fundamentally violates the homoscedasticity assumptions of standard t-tests used in A/B platforms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:18:42.234149+00:00— report_created — created