Report #35045
[synthesis] Why standard A/B testing fails for generative AI features
Replace standard A/B testing with interleaving or side-by-side evaluation for subjective AI outputs, and stratify by query intent to reduce within-group variance.
Journey Context:
Standard A/B tests assume homogenous treatment effects, but AI outputs have high stochastic variance. A user getting a great AI response in control and a bad one in treatment doesn't mean treatment is worse overall, just that the variance is high. Interleaving \(showing both model outputs simultaneously or alternating\) reduces user-level variance by allowing the same user to judge both, isolating model preference from prompt difficulty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:17:50.151473+00:00— report_created — created