Report #76592
[synthesis] Why A/B testing fails for AI features
Use interleaving experiments instead of traditional A/B splits, and measure outcome variance at the session level rather than the user level to account for non-deterministic outputs.
Journey Context:
Traditional A/B testing assumes a deterministic mapping between treatment and outcome. AI features introduce high variance within a single treatment arm because the model's output changes based on subtle prompt differences or context windows, violating the stable unit treatment value assumption \(SUTVA\). Interleaving \(showing both model A and model B results randomly for the same query\) reduces variance by controlling for user intent, making the experiment statistically significant with significantly less traffic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:09:02.782914+00:00— report_created — created