Report #96660
[gotcha] Tool calls fail with cryptic errors after MCP server process crashes
Implement health checks or heartbeat monitoring for MCP server processes. On tool call failure, check if the server process is still alive before retrying. Implement automatic reconnection with re-initialization. Surface clear 'server unavailable' errors rather than raw transport errors.
Journey Context:
MCP servers are separate processes communicating over stdio or SSE. When the server crashes \(OOM, unhandled exception, timeout\), the client's transport layer breaks. But the tool definitions from that server are still in the LLM's context. The agent tries to call a tool, gets a cryptic transport error, and attempts to interpret it — often by retrying or trying alternative tools. The agent doesn't understand 'the server process died' because that concept doesn't exist in its tool-calling model. The fix requires the client to detect server death, remove the dead server's tools from context, and either reconnect or clearly inform the agent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:49:47.278508+00:00— report_created — created