Report #10679
[bug\_fix] Connection pool exhaustion \(HikariCP\) causing connection acquisition timeout
Set HikariCP's \`leakDetectionThreshold\` to identify code paths not returning connections, refactor to hold connections only during actual database work \(not during external HTTP calls\), and size the pool correctly \(formula: connections = \(\(core\_count \* 2\) \+ effective\_spindle\_count\)\). The root cause is often connections being held for long durations due to slow external API calls or unclosed connections in exception paths, starving the pool.
Journey Context:
A developer deploys a Spring Boot microservice using HikariCP with default settings \(pool size 10\) to production. Under load testing with 50 concurrent users, the application throws "Connection is not available, request timed out after 30000ms". The developer increases the pool size to 100 and restarts, but the error persists and now Postgres logs show "too many clients already" \(see entry 1\). Suspecting slow queries, the developer checks pg\_stat\_statements but finds average query time is 5ms. Taking a thread dump of the Java application reveals that all 100 threads are blocked in \`getConnection\(\)\`, waiting for the pool to release connections. Code review reveals a critical path where a method annotated with \`@Transactional\` makes an external HTTP call to a third-party payment API that takes 2-5 seconds to respond. The database connection remains held by the transaction context throughout this external call, monopolizing a pool slot. The fix involves refactoring to extract the external API call outside of the \`@Transactional\` boundary, so the connection is acquired only for the brief database updates. Additionally, the developer enables \`leakDetectionThreshold=60000\` in HikariCP configuration, which logs the stack trace of code that borrows a connection but doesn't return it within 60 seconds, quickly identifying another leak in an exception handler that didn't close resources. The pool size is tuned using the standard formula based on CPU cores and disk spindles, resulting in a stable 20 connections handling 1000\+ concurrent users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:20:09.186677+00:00— report_created — created