Report #36246
[synthesis] Why A/B testing fails for AI features
Use stratified sampling based on user intent and query difficulty, and increase significance thresholds to account for LLM output variance; measure the delta of medians rather than the delta of means.
Journey Context:
Traditional A/B tests assume stable treatment effects. LLMs have high variance in output quality based on prompt phrasing and input complexity. A simple 50/50 split often results in the variance of the model's non-determinism drowning out the actual feature signal, leading to false negatives. You must control for input complexity to isolate the model's performance delta, otherwise you are just measuring noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:19:12.328120+00:00— report_created — created