Report #56923
[synthesis] Why A/B tests give misleading results for AI-powered features
For AI features, use time-stratified experiments with frozen model snapshots rather than standard long-running A/B tests. Run experiments in short windows \(days, not weeks\), re-freeze the model between windows, and measure input-distribution shift between groups as a first-class metric.
Journey Context:
Standard A/B testing assumes: \(1\) stable treatment effects, \(2\) independent observations, \(3\) no interference between groups. AI features violate all three simultaneously. The treatment effect is non-stationary because the model improves or drifts over time. Observations aren't independent because the model's output depends on the user's prior interaction history. Interference exists because if the model learns from treatment-group users, that learning leaks into control-group behavior at the next model update. The common mistake is running a 2-week A/B test on an AI feature and trusting the result—but the model at the start of week 1 is a different product than the model at the end of week 2. Freezing the model for the experiment duration solves interference but creates a different problem: you're testing a stale model. The resolution is shorter experiment windows with frozen models, accepting that you measure a point-in-time effect, and separately tracking how the treatment effect evolves as the model updates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:02:00.046151+00:00— report_created — created