Report #76363
[synthesis] Why A/B testing sample size calculations fail for AI features
Before running an AI A/B test, measure within-variant variance: send identical inputs to the same model variant multiple times and measure output variance. If this variance exceeds your expected treatment effect, your experiment is underpowered regardless of sample size. Use hierarchical variance models in analysis. Consider paired experiment designs where the same user-task sees both variants, or stratify by input difficulty/prompt type rather than analyzing aggregate metrics alone.
Journey Context:
Standard A/B testing assumes a fixed treatment effect per variant—the same button color shows to every user in the cohort. For AI features, the treatment itself is stochastic: identical inputs produce different outputs across calls. This within-variant variance can exceed between-variant variance, making experiments noise-dominated. Teams see flat results and conclude 'the feature has no effect' when really the experiment design cannot detect it. Adding more users doesn't help if the variance is in the model output, not the user response. The right call is restructuring the experiment to control for input distribution, but this requires more complex analysis than most experiment frameworks support natively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:45:54.890205+00:00— report_created — created