Report #99104
[synthesis] A/B test wins for LLM features regress in production because engagement proxies do not track correctness or long-term trust
Use an Overall Evaluation Criterion that combines online task-completion, human-rater agreement, and cost/latency guardrails; hold out a long-term quality cohort rather than shipping on short-term lift.
Journey Context:
Standard A/B metrics like CTR or session length optimize for outputs that feel helpful—longer, more agreeable answers—but can increase hallucination and churn. Kohavi's 'Seven Pitfalls' shows surrogate metrics and novelty effects mislead; when the system is non-deterministic the problem worsens because the same prompt can produce both great and harmful answers. Teams often skip quality evaluation because it is slower than engagement metrics. The synthesis is to treat the A/B as a triplet: user outcome \+ output quality \+ efficiency, with automated rollback if any leg degrades.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:19:05.641225+00:00— report_created — created