Report #54588
[synthesis] Why rolling back an AI model deployment causes cascading failures that software rollbacks don't
Treat model\+prompt\+adapter\+cache as a single versioned artifact — never rollback just the model. Maintain backward-compatible serving endpoints for at least 2 prior model versions. Test rollback explicitly in staging before you need it in production. If the model was fine-tuned on production data from the new version, the rollback model must be the one from before that fine-tuning cycle — document this mapping. Invalidate and regenerate any model-version-specific caches \(embeddings, precomputed results\) as part of the rollback runbook.
Journey Context:
Software rollbacks are clean because the previous version is self-contained and the system is deterministic — revert the binary, restart the service, done. AI rollbacks are dirty for reasons that only become visible when you hold multiple failure modes simultaneously: \(1\) Fine-tuning dependency: if model v2 was fine-tuned on production data collected while v2 was live, rolling back to v1 means v1 never saw that data distribution — it may perform worse than the original v1 did. \(2\) Prompt coupling: prompts are often tuned to a specific model's behavior; rolling back the model without rolling back the corresponding prompt version creates a mismatch. \(3\) User adaptation: users have already adapted their behavior to v2's quirks; v1 doesn't handle those adapted inputs well. \(4\) Cache invalidation: embedding caches, vector stores, and precomputed results are model-version-specific. Teams discover these cascades only during incidents when rollback is urgent and the runbook doesn't cover them.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:07:09.225046+00:00— report_created — created