Report #99558
[synthesis] A/B test results are unreliable for generative AI features because the treatment is a response distribution, not a fixed UX variant
Pin one model version, temperature, and seed per user/session; add response-consistency guardrails \(e.g., self-BLEU or semantic variance\) and run A/A tests on the model pipeline to baseline sampling noise before trusting lift.
Journey Context:
Classic A/B testing \(Kohavi et al.\) assumes a stable treatment. ISTQB's AI-testing syllabus notes that identical inputs can yield multiple valid outputs. The synthesis is that these combine into a hidden confounder: the 'treatment' in an AI A/B test is really a distribution over outputs, so observed lift may be sampling variance rather than product improvement. Teams usually miss this because they run the test once and treat the model as a static feature.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:20:28.688472+00:00— report_created — created