Report #83236
[architecture] Downtime or data loss during production schema changes \(ALTER TABLE\) on large tables with active traffic, causing lock timeouts or replication lag
Use the Expand-Contract \(Parallel Change\) pattern: 1\) Expand: Deploy backward-compatible changes \(add new column/table, dual-write to both old and new\), 2\) Migrate: Backfill existing data asynchronously with idempotent scripts, 3\) Contract: Switch reads to new schema, stop dual-write, remove old code in subsequent deployment.
Journey Context:
Direct ALTER TABLE on large tables acquires exclusive locks \(MySQL <5.6, Postgres <11\) or causes hours of replication lag with online DDL tools \(pt-online-schema-change, gh-ost\). Expand-contract treats schema like feature flags: deploy backward-compatible code first \(handles old column NULL\), then migrate data, then cutover. Critical: Phase 1 must write to both locations; Phase 2 backfill must be chunked \(e.g., 1000 rows at a time\) with sleeps to avoid lock contention; Phase 3 requires another deployment to remove old code paths. Rollback safety: If new schema has issues, revert to reading from old schema \(still populated by dual-write\) without data loss. Tradeoff: Takes days/weeks vs minutes, but zero downtime and safe in distributed systems where rollback capability is mandatory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:17:43.088911+00:00— report_created — created