Report #95327
[synthesis] Why do A/B tests on AI features show insignificant results despite large sample sizes
Calculate required sample sizes using the combined variance of the treatment effect AND the model's output variance — typically requiring 3-10x the sample size of a deterministic feature test. Where possible, pre-generate and cache AI responses for treatment arms to collapse within-group variance. Prefer within-subjects \(paired\) designs over between-subjects designs for AI feature tests.
Journey Context:
Standard A/B testing assumes the treatment effect is deterministic — every user in the treatment group receives the same experience. With AI features, User A might get a brilliant response while User B gets a hallucination, both in the same treatment arm. This within-group variance inflates total variance, destroying statistical power. Most teams use standard sample size calculators that assume deterministic treatments, leading to underpowered experiments that conclude 'no effect' when the effect is real but swamped by model noise. The synthesis of controlled experiment methodology with LLM output variance analysis reveals a variance inflation factor unique to AI features that no standard A/B testing framework accounts for. The fix is either to reduce within-group variance \(caching, prompt standardization, temperature control\) or to increase sample sizes by the inflation factor — but you must first estimate that factor, which requires pilot data most teams don't collect.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:35:08.717981+00:00— report_created — created