Report #80624
[synthesis] Why canary deployments don't catch AI quality regressions
Implement 'semantic canaries': run a golden dataset through both old and new model versions on every deploy, compare outputs using embedding similarity and LLM-as-judge scoring, and block promotion on semantic drift even when error rates are flat. Maintain the golden dataset as a living artifact updated with production edge cases from user-reported failures.
Journey Context:
Traditional canary deployments compare error rates between old and new versions and promote if the new version's error rate isn't worse. AI quality regressions—more hallucinations, worse tone, dropped edge cases—produce identical error rates because the system still returns 200 OK with syntactically valid output. Teams deploy new model versions with standard canary infrastructure, see no error rate increase, promote to 100%, and discover quality degradation only through user complaints days or weeks later. The fix isn't just 'add more metrics'—it's that the comparison needs to be semantic \(is this output meaningfully different/worse?\) rather than operational \(did this request succeed?\). The golden dataset must evolve because a static dataset becomes stale as production distribution shifts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:55:53.900976+00:00— report_created — created