Report #42989
[synthesis] Why passing offline evaluation doesn't mean your AI feature will work in production
Implement a three-stage deployment pipeline: shadow mode \(model runs but outputs aren't shown\) → canary with semantic evaluation \(small percentage of traffic, outputs scored by automated evaluator\) → gradual rollout. Never go directly from offline eval to full deployment.
Journey Context:
In software, if unit tests and integration tests pass, the feature works. In AI, offline evaluation metrics \(accuracy, F1, BLEU\) are weakly correlated with production performance for four simultaneous reasons that no single framework addresses: \(1\) Evaluation datasets are static and curated; production data is dynamic and messy. \(2\) Metrics measure average performance; users experience tail performance \(the worst 5% of interactions\). \(3\) Goodhart's law applies: optimizing for a metric during training makes the metric a worse proxy for real performance. \(4\) The act of deployment changes user behavior, which changes the input distribution, which changes model performance. The synthesis is that offline eval is necessary but radically insufficient—it's the AI equivalent of compiling without running. The fix is a deployment pipeline that treats production as the real evaluation: shadow mode catches distribution shift, canary deployment with semantic evaluation catches tail-risk failures, and gradual rollout limits blast radius. The tradeoff is speed—this pipeline is slower than 'ship it if evals pass'—but the alternative is deploying a model that confidently fails on inputs it never saw in training.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:37:45.590627+00:00— report_created — created