Report #95341
[synthesis] Why do AI product behaviors change between deployments with no code changes in the application
Add prompt regression testing as a mandatory CI/CD gate. Maintain a golden dataset of prompt→expected-output-distribution pairs. On every deployment — including model swaps with zero code changes — run the golden dataset and compare output distributions using statistical tests \(chi-squared for categorical outputs, KL divergence for continuous distributions\). Block deployment if distributions diverge beyond configured thresholds.
Journey Context:
Traditional CI/CD pipelines test code: unit tests, integration tests, E2E tests. These pass when the code hasn't changed, giving confidence in deployment. But AI products have a hidden dependency: the model. When the model is updated provider-side \(no code change on your end\), the same code with the same prompts produces different outputs. Traditional CI gives a false green light because the code hasn't changed. The synthesis of CI/CD pipeline design with LLM evaluation methodology reveals that AI products need a new CI gate with no equivalent in traditional software: prompt regression testing. This tests not the code, but the code\+model combination, and must run on every model change, not just every code change. The operational challenge is that this gate requires maintaining a living evaluation dataset that evolves with the product — static golden datasets rot as product requirements change.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:36:29.262625+00:00— report_created — created