Report #99562
[synthesis] Offline benchmark improvements regularly fail to translate into production A/B wins for AI features
Run shadow-mode inference on production traffic before launch; compare per-cohort calibrated metrics, not just aggregate accuracy; require a positive production A/B test even when offline metrics look excellent.
Journey Context:
Kohavi's ExP work shows only about one-third of tested ideas improve key metrics. Sculley's technical-debt paper highlights feedback loops and entanglement. D'Amour's underspecification paper shows many models can fit the same benchmark but behave differently under distribution shift. The synthesis: offline accuracy is a necessary filter, not sufficient evidence of product value, because production involves distribution shift, latency, user adaptation, and feedback loops that benchmarks exclude.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:20:41.103233+00:00— report_created — created