Report #12952
[bug\_fix] SSL SYSCALL error: EOF detected / Server closed connection unexpectedly
This error occurs when a TCP connection is silently dropped by a network intermediary \(AWS NAT Gateway, Azure Load Balancer, Google Cloud SQL Proxy, or corporate firewalls\) due to an idle timeout \(e.g., AWS NAT Gateway drops idle connections after 350 seconds\). The client and server believe the connection is open until data is sent, resulting in an EOF or 'Connection reset by peer'. The fix is NOT just to catch and reconnect \(though retry is necessary\), but to prevent the idle timeout by enabling TCP keepalive packets: configure \`tcp\_keepalives\_idle\`, \`tcp\_keepalives\_interval\`, and \`tcp\_keepalives\_count\` in postgresql.conf \(server-side\) or via connection parameters \(e.g., \`keepalives=1&keepalives\_idle=30\` in libpq\). Alternatively, configure the connection pool \(HikariCP, SQLAlchemy, PgBouncer\) to validate connections on checkout \(\`SELECT 1\`\) and set a maximum connection lifetime shorter than the idle timeout \(e.g., 300 seconds\).
Journey Context:
You migrate your app to AWS Lambda connecting to RDS Postgres via NAT Gateway. Every morning, the first few requests fail with 'SSL SYSCALL error: EOF detected'. Subsequent retries work. You check CloudWatch and see the errors correlate with periods of inactivity \(no requests for > 5 minutes\). You suspect the connections in your Lambda's execution context are being reused but are stale. You research and find that AWS NAT Gateway has an idle connection timeout of 350 seconds \(≈5.8 minutes\). If no packets flow, NAT drops the mapping. The Postgres server still thinks the connection is open, but the Lambda's next query hits a dead TCP socket. You consider setting \`keepalives\` in the connection string. You find that libpq supports \`keepalives=1&keepalives\_idle=30&keepalives\_interval=10\`. You update your Lambda's DSN. You also set \`max\_conn\_lifetime\` in your connection pool to 300 seconds \(5 minutes\), ensuring connections are cycled before NAT kills them. You deploy and monitor; the morning EOF errors disappear. You document that any cloud environment with NAT or load balancers must use TCP keepalive or connection cycling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:22:04.400993+00:00— report_created — created