Report #80736
[gotcha] MCP server crashes and is never restarted, all subsequent tool calls hang or fail
Implement a health-check heartbeat on the MCP server transport. On detection of a dead server process, tear down and re-initialize the connection. For stdio: monitor child process exit event and respawn. For SSE/Streamable HTTP: implement connection-level timeout and retry with backoff.
Journey Context:
MCP servers are long-running processes that can crash from unhandled exceptions, OOM, or dependency failures. The MCP transport spec defines lifecycle events but does not mandate automatic restart. Many clients treat a crashed server as a permanent failure—all subsequent tool calls to that server return errors or hang. The fix requires client-side supervision logic: detect the crash \(process exit event, failed heartbeat, or timeout\), re-spawn the server process, re-send initialize, and resume. This is operational plumbing that no one thinks about until a production server crashes at 2 AM and the agent becomes permanently unable to use half its tools.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T18:07:00.934649+00:00— report_created — created