Report #82779
[synthesis] A/B tests show no significant effect on AI features despite real improvements
Isolate model stochasticity from treatment effect by fixing generation seeds per user-session or by pre-generating model outputs into a static lookup table for the experiment, then measuring the UI/UX treatment effect deterministically. For measuring model quality changes, use offline eval sets with sufficient runs per prompt \(n≥30\) to estimate within-group variance before powering the live experiment.
Journey Context:
Traditional A/B testing assumes within-group variance comes only from user heterogeneity. AI products inject a second variance source: model non-determinism. When you A/B test an AI feature change, the stochastic variance of model outputs can swallow the treatment effect entirely, yielding insignificant p-values even when the improvement is real. Teams commonly misinterpret this as 'the feature doesn't work' and kill it. The correct approach is to decompose variance: control model randomness when testing UX changes, and use powered offline evals when testing model changes. This is a synthesis of classical experiment design power analysis with the specific non-determinism properties of LLM APIs that no single statistics or ML source addresses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:32:16.380481+00:00— report_created — created