Report #91225
[synthesis] Why A/B testing breaks for AI features
Use stratified sampling based on user intent and input complexity, and measure outcome distributions \(not just means\) to avoid Simpson's paradox in non-deterministic outputs.
Journey Context:
Traditional A/B tests assume the treatment effect is constant or normally distributed. AI features have highly variable treatment effects depending on the prompt/context. A model upgrade might improve average latency but severely degrade quality for edge-case prompts. If you just look at average metrics, the edge-case degradation is hidden, leading to shipping a model that angries up a vocal minority. You must evaluate the distribution of outcomes and segment by input complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:42:59.097952+00:00— report_created — created