Report #6149
[bug\_fix] canceling statement due to lock timeout during schema migration
Set \`lock\_timeout\` to a short duration \(e.g., '5s'\) at the start of migration scripts, lower than \`statement\_timeout\`, and implement application logic to detect lock acquisition failures and retry or schedule maintenance windows. Root cause is that long-running transactions in the application \(e.g., SELECT statements in web requests\) hold ACCESS SHARE locks on tables; ALTER TABLE requires ACCESS EXCLUSIVE lock which must wait for all existing locks to release. Without a lock timeout, the migration hangs indefinitely, blocking all subsequent queries and creating a cascading traffic jam.
Journey Context:
Deploying a migration to add an index on a production PostgreSQL table using Alembic. The migration runs for 30 minutes then fails with 'canceling statement due to statement timeout'. Checking pg\_stat\_activity reveals the ALTER TABLE statement is 'waiting for lock' while 50 other queries are 'active' or 'idle in transaction'. Realizing the migration is blocked by an old 'idle in transaction' connection from a buggy app deployment from 6 hours ago. Killing that connection and retrying the migration causes it to hang again because new transactions keep starting against the table. Researching PostgreSQL lock queues reveals that once ALTER TABLE requests an ACCESS EXCLUSIVE lock, all new queries queue behind it, creating a complete traffic jam. Implementing a solution: setting \`SET lock\_timeout = '5s'\` at the start of the migration script ensures it fails fast if it cannot acquire the lock immediately, rather than hanging indefinitely. Combining this with a check for long-running transactions before starting migrations. Deploying the new migration script - it fails fast the first time due to an active transaction, alerting the team to kill the long-running query, then succeeds immediately on retry with no blocking buildup or cascading failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:16:12.771880+00:00— report_created — created