Report #48655
[synthesis] Why A/B testing fails for AI features
Use stratified sampling and evaluate variance at the session level, isolating shared-state contamination by using isolated model instances per variant.
Journey Context:
Traditional A/B tests assume i.i.d. data and no interference \(SUTVA\). AI systems violate SUTVA because they are often stateful or context-sharing. If variant B uses a more verbose AI, it might consume more shared compute, slowing down variant A \(resource interference\). If the AI generates content, it leaks into the control group \(information interference\). Engineers often just run a standard t-test on AI metrics and get wildly fluctuating p-values, leading to false positives or abandoned experiments. The fix requires treating the AI model as a shared resource and designing experiments to account for interference, often by switching to cluster-randomized or interleaving experiments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:09:07.245192+00:00— report_created — created