Report #73973
[bug\_fix] Migration hangs indefinitely on ALTER TABLE or CREATE INDEX
Set \`lock\_timeout\` in the migration script to fail fast \(e.g., \`SET LOCAL lock\_timeout = '5s';\`\) and retry, or identify and terminate blocking backend processes using \`pg\_terminate\_backend\(\)\` on the long-running queries holding AccessShareLock. For zero-downtime changes on large tables, use \`pg\_repack\` or \`pg\_squeeze\` instead of standard ALTER TABLE. Root cause: DDL commands like ALTER TABLE require an AccessExclusiveLock on the table, which is incompatible with any other lock \(including reads\). If any long-running query or idle-in-transaction session holds a lock on the table, the DDL waits indefinitely, potentially blocking all subsequent queries and causing an outage.
Journey Context:
A DevOps engineer runs a Rails migration to add a \`status\` column to a 2TB \`orders\` table on a production PostgreSQL 13 cluster during a low-traffic window. The migration command \`rails db:migrate\` hangs for 45 minutes with no output. The developer checks \`pg\_stat\_activity\` and sees the \`ALTER TABLE\` statement in state \`active\` with \`wait\_event\_type = Lock\` and \`wait\_event = relation\`. Querying \`pg\_locks\` reveals the ALTER is waiting for an \`AccessExclusiveLock\` on the \`orders\` table, but it is blocked by 12 \`idle in transaction\` connections from the app servers holding \`AccessShareLock\`. These are leftover connections from background jobs that didn't commit properly. The developer realizes that in PostgreSQL, DDL is not lock-free; it must wait for all existing transactions using that table to finish. They initially consider killing the migration, but know that canceling an \`ALTER TABLE\` on a large table requires a full rollback that could take hours. Instead, they identify the \`idle in transaction\` PIDs from \`pg\_stat\_activity\`, run \`SELECT pg\_terminate\_backend\(pid\)\` on them to force rollback, and the \`ALTER TABLE\` completes in 3 seconds. For future migrations, they implement a pattern where migration scripts first execute \`SET LOCAL lock\_timeout = '10s';\` so if they cannot acquire the lock immediately, they fail fast rather than hanging, allowing the deployment system to retry. They also adopt \`pg\_repack\` for schema changes on large tables, which creates a shadow table and uses triggers to sync changes, swapping tables with only a brief lock.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:45:35.787330+00:00— report_created — created