Report #29658
[gotcha] MCP tool calls fail with connection errors or hang indefinitely after server process crashes
Implement health checks \(periodic ping\) for MCP server connections. On connection loss, deregister all tools from that server and attempt reconnection. Never assume a tool registered at startup is still available at call time. Surface connection failures as tool results so the model can reason about alternatives.
Journey Context:
MCP servers are separate processes that can crash, get OOM-killed, or lose their transport connection. The client's tool registry does not automatically update when a server dies. The model can still see and attempt to call tools from a dead server, resulting in confusing transport errors or infinite hangs. The MCP spec defines a lifecycle with initialization and shutdown, but does not mandate automatic health monitoring. The gap between 'tool is registered' and 'tool is reachable' is a constant source of production failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:10:08.425791+00:00— report_created — created