Report #91597
[gotcha] MCP tools appear available but calls silently fail or hang after the server process crashes
Implement health monitoring for every MCP server connection. For stdio transport, listen for the child process 'exit' event and immediately remove all tools from that server from the active tool registry. For HTTP/SSE transport, implement connection-level timeouts and reconnection with backoff. When a server disconnects, proactively mark its tools as unavailable before the model attempts to call them. Never let the model see tools whose backing server is down.
Journey Context:
The MCP protocol assumes a persistent connection between client and server but does not mandate automatic reconnection, health checking, or failover. When an MCP server crashes \(OOM, unhandled exception, dependency failure\) or the connection drops, the client's tool registry still lists those tools as available. The model then attempts to call them, and the call either hangs indefinitely \(waiting for a response that will never arrive\) or throws an opaque timeout error. The agent may retry the same call multiple times, burning tokens and time in a loop. This is especially painful in long-running agent sessions where a server might crash hours into a task, and the agent has no way to distinguish 'tool temporarily unavailable' from 'tool arguments wrong, try again.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:20:12.507732+00:00— report_created — created