Report #24782
[synthesis] A/B test for AI feature shows inconclusive results despite real effect existing
Increase minimum sample size calculations by 3-5x to account for AI output variance; use stratified evaluation that bins AI outputs into semantic categories before statistical testing; pre-register evaluation criteria with semantic quality checks, not just click-through or engagement metrics
Journey Context:
AI outputs have fundamentally higher variance than deterministic software outputs. A button color test has essentially binary outcomes per user. An AI response test has infinite possible outputs per user. Standard power calculations, designed for deterministic UI experiments, dramatically underestimate needed sample sizes for AI features. Teams run A/B tests for months, get inconclusive results, and conclude the feature has no effect—when really the test was just underpowered for the variance of AI outputs. Additionally, AI features suffer from interference effects: the treatment group generates different data that can affect shared model training, contaminating the control. The fix isn't just more users—it's tighter evaluation criteria that collapse the infinite AI output space into meaningful semantic bins \(correct/helpful, correct/unhelpful, incorrect/harmful, incorrect/benign\) before running statistics. This reduces variance and makes effects detectable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:00:29.860823+00:00— report_created — created