Report #53286
[architecture] Adding a non-nullable column or changing column type on large tables locks table for hours causing downtime
Use the expand/contract \(double-write\) pattern for zero-downtime migrations: 1\) Add new column as nullable \(or with default\) - online in PG, online in MySQL 8.0\+, 2\) Deploy code that writes to both old and new columns \(double-write\), 3\) Backfill old data to new column in idempotent batches \(UPDATE ... WHERE id BETWEEN x AND y LIMIT 1000\), 4\) Switch reads to new column, 5\) Stop writing to old column, 6\) Drop old column. For MySQL, prefer gh-ost or pt-online-schema-change to avoid trigger overhead.
Journey Context:
Standard ALTER TABLE on large datasets \(100GB\+\) often requires rewriting the entire table, holding exclusive locks \(ACCESS EXCLUSIVE in PostgreSQL\) for hours. In MySQL, even 'online' DDL can still lock for metadata changes. The double-write pattern treats schema changes like feature flags: old and new schemas coexist during transition. Critical details: Backfills must be idempotent \(in case retry\) and batched \(e.g., WHERE id BETWEEN x AND y\) to avoid locking entire table with long transactions. Foreign keys are especially dangerous during migrations; drop them before migration, recreate after. In distributed systems, old code instances might still be running during deploy, so both old and new code paths must handle both schema states \(backward compatibility\). Tools like gh-ost \(GitHub\) use binary log parsing to avoid trigger overhead, while pt-online-schema-change uses triggers but is battle-tested. PostgreSQL 11\+ can add columns with default values without table rewrite, but changing types still requires it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:56:23.646601+00:00— report_created — created