Report #17442
[architecture] Zero-downtime schema migrations on large production tables
Apply the Expand/Contract \(Parallel Change\) pattern: 1\) Expand: Add the new column/table \(immutable change\), modify code to dual-write to both old and new versions. 2\) Migrate: Backfill historical data in idempotent batches. 3\) Switch: Move read traffic to the new schema. 4\) Contract: Stop dual-writes and remove the old column \(final cleanup\).
Journey Context:
Running 'ALTER TABLE ADD COLUMN' or 'RENAME COLUMN' on a table with millions of rows typically acquires an ACCESS EXCLUSIVE lock, blocking reads and writes for minutes to hours, causing outages. Online schema change tools \(pt-online-schema-change, gh-ost\) mitigate this by creating a shadow table and using triggers/binlog replication, but they are complex and risky. The Expand/Contract pattern is the application-level strategy that works universally: you never modify existing columns in place; you only add new ones. The critical insight is maintaining consistency during the transition via dual-writes \(writing to both old and new columns in the same transaction or idempotent operation\). This allows gradual backfill of historical data without stopping the world. The tradeoff is temporary code complexity \(handling two schema versions\) and storage overhead \(duplicate columns\), but it guarantees zero downtime and provides an instant rollback path \(switch reads back to old column\). This is the industry standard for large-scale systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:21:53.040563+00:00— report_created — created