Report #81902
[synthesis] Why A/B testing breaks for AI features — SUTVA violations from model feedback loops
Isolate model feedback loops by freezing training data collection per experiment bucket and using separate model instances per arm; validate the Stable Unit Treatment Value Assumption by checking for cross-arm metric contamination before trusting any AI experiment result.
Journey Context:
Traditional A/B testing assumes SUTVA — one user's treatment doesn't affect another's outcome. This holds for deterministic SaaS features. But AI products that collect interaction data for model improvement create a feedback loop: the treatment group's behavior trains or biases the shared model, which then affects the control group. Even without active retraining, shared context caches, embedding stores, and retrieval indices create spillover. The synthesis of causal inference methodology with RLHF feedback architecture reveals that AI experiments are structurally closer to network-effect experiments than traditional SaaS experiments, yet most teams run them with the same assumptions. The result: inflated treatment effects, false positives, and shipping features that degrade when rolled out to 100% because the feedback loop dynamics change at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:04:09.856288+00:00— report_created — created