Report #65507
[synthesis] Why A/B testing AI features produces contaminated or contradictory results
Route experiment variants to isolated model instances with separate weights and context. Never share a model endpoint between control and treatment if the model ingests user interaction data. Use feature flags at the model level, not just the application level. Validate experiment isolation by checking for metric drift in the control group after treatment deployment.
Journey Context:
The fundamental assumption of A/B testing is independence: what happens in treatment doesn't affect control. In deterministic software, showing user A a blue button doesn't change what user B sees. But AI products violate this by default. If the model does online learning, treatment-group interactions alter the model serving control users. Even without online learning, shared context windows, caching layers, and RAG indices create leakage paths. The synthesis of Kohavi's experiment trustworthiness principles with ML serving architecture reveals that shared model endpoints are a structural confound — the experiment is measuring the treatment plus contamination, not the treatment alone. Teams must architect model-level isolation, not just application-level randomization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:26:13.512806+00:00— report_created — created