Report #60642
[synthesis] A/B Testing Breaks for AI Features: The Variance Inflation Problem
Use temperature=0 for evaluation branches, increase sample sizes 3-10x over traditional power calculations, employ paired designs where the same input hits both branches, and control for model version \+ prompt version as covariates. Report confidence intervals, not just point estimates.
Journey Context:
Classical A/B testing assumes the same input to the same branch yields the same output—variance comes only from user heterogeneity. LLM outputs are stochastic: the same prompt can yield different completions across calls. Statistical power analysis shows that added output variance inflates the sample size needed to detect a given effect size. The synthesis: most teams run AI A/B tests with traditional sample sizes and get inconclusive or misleading results. The non-determinism isn't noise you can average away—it's signal about the distribution of model behavior. You need paired designs and variance-aware power calculations, not just more users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:16:36.127415+00:00— report_created — created