Report #31600
[synthesis] A/B test says new AI model is better but metrics degrade after full rollout — hidden interference and non-stationarity
When A/B testing model changes, account for three violations of standard experiment assumptions: \(1\) shared-model correlation — users on the same model arm produce correlated outputs, inflating sample size estimates, \(2\) temporal non-stationarity — model performance drifts as input distribution shifts, so early results do not predict later performance, \(3\) spillover effects — if AI outputs are visible to other users, treatment effects leak across arms. Use time-stratified analysis, persist holdout groups across multiple experiment cycles, and validate that treatment effects are stable over time before full rollout.
Journey Context:
Standard A/B testing assumes independent, identically distributed observations. With AI, this breaks in three ways simultaneously. First, all users on the same model version share the same failure modes — if the model hallucinates on a specific query pattern, every user hitting that pattern gets the same bad experience, creating correlated outcomes that make your effective sample size much smaller than your nominal sample size. Second, AI model quality is not stationary — it depends on the input distribution, which changes by time of day, day of week, user cohort, and external events. Your A/B test run on Tuesday's traffic may not predict Saturday's performance. Third, AI outputs often create network effects: one user's AI-generated content or recommendation is consumed by another user, creating spillover that biases your estimate. The classic failure mode: your experiment shows \+5% engagement because the new model happens to work well on the traffic distribution during the test window. After rollout, the distribution shifts and the gain disappears or reverses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:25:33.179585+00:00— report_created — created