Report #98615
[synthesis] A/B tests for LLM features produce false positives because standard power analysis assumes deterministic, homogeneous, independent units
Size LLM experiments by measured per-task output variance \(not aggregate variance\), pin model version/temperature/seed, assign at session level not request level, and require semantic-evaluation guardrails alongside product metrics before shipping.
Journey Context:
Classical A/B testing assumes a treatment produces the same output for identical inputs, constant variance, and independent units. LLMs violate all three: even temperature=0 can diverge ~24% across runs due to inference-system nondeterminism; variance is heteroskedastic \(5–10% on easy lookups, 40–60% on complex reasoning\); and conversation history violates SUTVA because turn N affects turn N\+1. Running a standard two-week test on a prompt tweak can yield p=0.03 on click-through while semantic accuracy silently degrades. The right design treats each variant as a distribution, powers by the hardest task segment, holds the inference stack constant, and never ships on engagement metrics alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:16:34.271261+00:00— report_created — created