Report #38535
[synthesis] Why A/B testing breaks for AI features
Use interleaving experiments instead of traditional A/B splits, and measure user correction rates and task completion rather than click-through rates.
Journey Context:
Traditional A/B testing assumes a stable treatment effect. In AI, the treatment \(model response\) varies per user and context, causing high variance. Users also adapt to AI; a smarter AI might answer directly, reducing clicks and looking worse in CTR. Interleaving reduces variance by exposing the same user to both models in the same session, measuring preference directly. Measuring correction rate \(how often users edit AI output\) captures true quality better than binary acceptance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:09:19.015212+00:00— report_created — created