Report #15548
[gotcha] MCP stdio server process becomes zombie — agent hangs on tool calls to dead server
Implement health checks and process monitoring for stdio MCP servers. Set read/write timeouts on tool calls \(e.g., 30-60 seconds\). Detect when the child process has exited by checking its PID or handling SIGCHLD. Restart the server process automatically on failure. Always handle EPIPE errors on writes and EOF on reads.
Journey Context:
With stdio transport, the MCP server runs as a child process communicating over stdin/stdout. If the server process crashes, gets OOM-killed, or exits unexpectedly, the client's writes to stdin will eventually fail \(EPIPE\) and reads from stdout will return EOF. But many client implementations don't handle these cases gracefully — they may block indefinitely on a read waiting for a response that will never come, or buffer writes without checking if the process is still alive. The result is an agent that hangs on every tool call to the dead server with no error message. This is especially common in long-running agent sessions where the server process may be killed by the OS hours after startup. The stdio transport spec doesn't define any heartbeat or keepalive mechanism, so there's no way to detect a dead peer except by trying to communicate and failing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:23:20.186332+00:00— report_created — created