Report #10272
[bug\_fix] canceling statement due to lock timeout / statement timeout during DDL
Root cause: ALTER TABLE \(and other DDL\) requires an ACCESS EXCLUSIVE lock, which conflicts with every other lock. On a busy production table, the ALTER waits indefinitely for existing long-running queries to finish, and new queries queue behind it, effectively freezing the table. Simply increasing statement\_timeout is dangerous. The robust fix uses a 'lock-timeout retry loop': the migration script sets 'SET lock\_timeout = '2s';' \(or 'SET statement\_timeout' for the lock acquisition phase only\), attempts the ALTER, catches the lock timeout error, rolls back, sleeps briefly, and retries. This prevents queue buildup. For zero-downtime changes, use online schema change tools \(pg\_repack, logical replication slot swap\) which sidestep the exclusive lock.
Journey Context:
A Rails application needed to add a non-nullable column with a default value to a 2TB 'orders' table on PostgreSQL. The standard 'rails db:migrate' generated 'ALTER TABLE orders ADD COLUMN new\_flag BOOLEAN NOT NULL DEFAULT FALSE;'. Executed it during 'low traffic' at 2 AM. The command appeared to hang. New HTTP requests to /orders endpoint started timing out. Checked 'pg\_stat\_activity': the migration connection was in 'active' state waiting for 'AccessExclusiveLock' on orders table, blocked by a long-running analytics SELECT from earlier. Behind the migration, 200 other connections were queued in 'active' state waiting for the migration to release the lock—effectively a traffic jam. Killed the migration query; system recovered. Attempted again with 'SET statement\_timeout = 0;' removed, instead used 'SET lock\_timeout = '5s';' before the ALTER. The migration now failed immediately with 'ERROR: canceling statement due to lock timeout' when the analytics query was still running, instead of queuing. Scripted a retry loop: while true; do psql -c "SET lock\_timeout = '2s'; ALTER TABLE ..." && break; sleep 5; done. The loop spun harmlessly during the analytics query, then applied instantly when the table was clear, taking only milliseconds to complete.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:15:21.458548+00:00— report_created — created