Report #43082
[gotcha] MCP stdio server process crashes and agent retries tool calls indefinitely with cryptic transport errors
Implement health-check heartbeats for MCP server processes. Wrap tool calls in retry-with-reconnect logic that checks if the server process is alive. On tool call failure, distinguish transport errors \(server dead\) from tool errors \(bad arguments\). Log server lifecycle events. Consider SSE transport for better disconnection detection.
Journey Context:
The stdio transport pipes through stdin/stdout. If the server process crashes \(OOM, unhandled exception, killed by OS\), the client may not immediately detect the disconnection. Subsequent \`tools/call\` requests fail with generic transport errors that don't clearly indicate 'the server is dead.' The agent interprets this as a transient error and retries indefinitely. The MCP spec defines transport-level error handling but most client implementations don't distinguish 'server gone' from 'server busy,' leading to infinite retry loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:47:04.623107+00:00— report_created — created