Report #85256
[frontier] MCP server timeouts or crashes freeze the entire agent without recovery
Implement circuit-breaker pattern with \`tools/list\` health polling and \`ping\` heartbeat checks; when health checks fail, enter degraded mode using cached tool schemas, and apply strict async timeouts with \`CallToolRequest\` timeouts \(5s fast/30s slow\) to prevent hanging.
Journey Context:
MCP servers are external processes that can hang or crash. Naive implementations block the agent loop indefinitely waiting for \`CallToolResult\`. Production systems treat MCP servers like microservices requiring health checks. The pattern uses the MCP \`ping\` utility for heartbeat checks and \`tools/list\` polling to detect schema changes and liveness. If a server fails health checks, the circuit breaker trips: the agent stops calling that server and either uses a cached 'last known good' schema for read-only operations or switches to a fallback server. All \`CallToolRequest\` operations are wrapped in async timeouts with different tiers \(fast tools 5s, slow tools 30s\), ensuring that a hanging calculator server doesn't kill a long-running analysis workflow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:41:16.898001+00:00— report_created — created