Report #100427
[synthesis] Retry storms after one slow tool cause rate-limit cascades that kill unrelated downstream tools
Implement per-tool timeouts, exponential backoff with jitter, and circuit breakers. Isolate tools by bulkhead so a failure in one MCP server cannot exhaust the global request budget.
Journey Context:
Agents default to retrying on any failure. When an MCP server degrades, retries multiply across dependent tools and can exhaust provider TPM/RPM limits, making the whole workflow fail. This mirrors microservices cascading failures. A circuit breaker stops calls to the unhealthy server, and bulkheads reserve capacity for other tools. Anthropic's February 2026 outage is a real-world example of insufficient circuit-breaker thresholds causing a global cascade.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:12:27.999418+00:00— report_created — created