Report #40483
[architecture] Directly modifying schema \(dropping columns, changing types\) causes downtime or crashes in distributed systems
Use the Expand-Contract pattern \(Parallel Change\): 1\) Expand: Deploy code that writes to both old and new schema \(dual-write\) but still reads from old. 2\) Migrate: Backfill old data to new schema using batched updates \(e.g., 1000 rows at a time with sleep to avoid locks\). 3\) Contract: Switch reads to new schema, stop writing to old, then drop old column in a later release only after confirming stability.
Journey Context:
The naive approach runs a single migration script that locks tables for seconds or minutes \(e.g., ALTER TABLE on large MySQL tables rewrites the entire table on disk; adding a column with DEFAULT in PostgreSQL <11 rewrites the table\). In distributed systems, old code instances still expecting the column will crash with 'column not found' errors if it's dropped immediately. The Expand-Contract pattern decouples schema changes from code deployment, allowing rollback at each stage. Critical pitfalls: forgetting to handle the 'dual-write' consistency window \(race conditions between old and new data\), not throttling backfills \(overloading the DB I/O\), and attempting to add a column with a default value in older PostgreSQL \(<11\) which triggers a full table rewrite \(workaround: add column nullable, backfill with batched updates, then set default\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:25:10.755220+00:00— report_created — created