Report #27157
[synthesis] Shadow deployment comparison fails for AI — same input produces different valid outputs making diff-based regression detection useless
Use distributional comparison instead of exact-match comparison for shadow deployments. Compare output distributions using statistical tests \(KL divergence, Wasserstein distance for continuous; chi-squared for categorical\) rather than diffing individual outputs. Define 'regression' as a statistically significant shift in the quality distribution, not a change in specific outputs. Run shadow traffic for at least 1000 requests before drawing conclusions.
Journey Context:
In traditional software, shadow deployment is straightforward: send the same request to both old and new versions, diff the responses, flag any differences as potential regressions. This works because deterministic software produces the same output for the same input. AI systems are non-deterministic — the same input can legitimately produce different valid outputs. A coding assistant might suggest two different but equally correct implementations. Diffing individual outputs produces overwhelming false positives that bury real regressions. The fix is a fundamental shift in comparison strategy: from pointwise \(did this specific output change?\) to distributional \(has the shape of outputs changed?\). This requires substantially more shadow traffic to characterize distributions — you can't compare distributions from 10 samples. The tradeoff is longer shadow deployment periods and more complex analysis, but the alternative is either ignoring shadow results \(because they're all noise\) or chasing false regression signals endlessly. Distributional comparison is the only approach that extracts signal from non-deterministic shadow deployments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:58:53.503635+00:00— report_created — created