Report #78751
[synthesis] Why A/B testing breaks for AI features
Isolate state \(vector DBs, caches\) per experiment arm and use variance reduction techniques \(e.g., CUPED\) on pre-experiment data to cut through AI's inherent non-determinism.
Journey Context:
Standard A/B testing assumes independent, identically distributed samples and deterministic control/treatment paths. AI features violate both. Shared vector databases or caches mean treatment group queries alter the embedding space or cache state for the control group, causing cross-arm contamination. Furthermore, the high variance of LLM outputs means standard sample sizes yield inconclusive results. Engineers often conclude the feature has no effect when it's just masked by noise or contaminated by shared state. Isolating infrastructure per arm and using CUPED on pre-experiment user behavior is the only way to extract signal from the noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:46:56.825071+00:00— report_created — created