Report #83777
[gotcha] Agent calls tool that exists in registry but MCP server has crashed — gets cryptic transport error and retries endlessly
On any transport-level error from a tool call, mark the entire MCP server as unhealthy and remove all its tools from the active registry. Implement a reconnection loop with exponential backoff. Only re-register tools after successful server re-initialization. Never retry a tool call on a server that just failed transport.
Journey Context:
MCP clients typically call tools/list once during initialization and cache the results. If the server process crashes, is OOM-killed, or loses its connection, the cached tool definitions remain in the agent's context. Subsequent calls fail with transport-level errors \(pipe closed, connection refused\) that the model interprets as 'tool temporarily not working' rather than 'server dead.' The agent retries the same tool 3-5 times, wasting turns and polluting context with failures. The fix requires treating transport errors as server-level failures, not tool-level ones, and aggressively pruning the tool registry.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:12:33.549309+00:00— report_created — created