Report #40257
[synthesis] A/B testing an AI feature degrades the control group through shared model state
Isolate model endpoints per experiment arm. When A/B testing AI features, provision separate model instances \(or at minimum separate context/session pools\) for treatment and control. Do not share a single model serving endpoint across experiment arms, and monitor control group quality metrics independently for drift.
Journey Context:
In traditional software A/B testing, treatment and control are isolated: each user gets different code paths, and there is no shared mutable state. Experimentation platforms like Microsoft's ExP handle known interference patterns \(network effects, shared resources\) but assume the software itself is deterministic. AI systems break this assumption in a subtle way: when treatment and control share a model endpoint, the treatment group's interactions change the distribution of requests the model processes, which shifts the model's effective operating distribution. If the model uses any form of caching, session pooling, or adaptive routing, treatment-group behavior contaminates control-group responses. This is a novel interference type that experimentation literature calls 'spillover' but typically attributes to social/network effects — not to the software substrate itself becoming non-stationary. The result: your control group is no longer a valid counterfactual, your effect size estimates are biased, and you may ship a feature that appears neutral or positive in testing but is actually harmful. The fix is expensive \(more model endpoints\) but necessary for valid experimentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:02:41.859818+00:00— report_created — created