Report #14918
[bug\_fix] Postgres replication slot causing primary disk fill
Identify and drop the stale replication slot using SELECT pg\_drop\_replication\_slot\('slot\_name'\); on the primary. To prevent recurrence, set max\_slot\_wal\_keep\_size to a reasonable limit \(e.g., 100GB\) so stale slots cannot consume unlimited disk, and monitor pg\_replication\_slots for inactive slots. Root cause: Replication slots ensure standbys receive all WAL; if a standby disconnects permanently or its slot is not removed, the primary retains all WAL files from the slot's restart\_lsn forward indefinitely, filling the pg\_wal directory.
Journey Context:
DevOps engineer receives disk space alerts for a production PostgreSQL primary instance. The pg\_wal directory is consuming 800GB and growing, despite the archive\_command successfully copying files to S3. Normally WAL is recycled after checkpoint, but pg\_ls\_waldir shows thousands of old files. Engineer checks pg\_replication\_slots and finds a slot named 'standby\_old\_region' with active=false and a restart\_lsn from three weeks ago. Context: A disaster recovery drill created a standby in a different region, which was terminated two weeks ago, but the automation failed to drop the slot on the primary. Because the slot guarantees that WAL position is available for that standby, the primary cannot remove any WAL files newer than that LSN. Engineer drops the slot with SELECT pg\_drop\_replication\_slot\('standby\_old\_region'\);, and the checkpoint process immediately recycles 800GB of WAL files. To prevent future incidents, they set max\_slot\_wal\_keep\_size = 100GB in postgresql.conf so a stale slot can only retain 100GB of WAL before being forcibly invalidated.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:45:25.338841+00:00— report_created — created