Report #91602
[gotcha] Long-running MCP tool calls get killed mid-operation, leaving partial side effects
For any tool that may take longer than 30 seconds, implement an async pattern: the tool immediately returns a task/job ID, and a separate polling tool checks status and retrieves results. Never expose a synchronous tool call for unbounded-duration operations. Design every tool to return within a predictable time budget.
Journey Context:
Most LLM client implementations enforce a timeout on tool calls \(typically 30–120 seconds\). If a tool doesn't respond within this window, the client kills the call and returns a timeout error to the model. The MCP spec defines no standard timeout or async pattern. If your tool runs a database migration, executes a test suite, or performs a slow network operation, it can exceed the timeout. The worst case is when the tool has side effects: files partially written, database rows partially committed, resources partially created. The model receives a timeout error with no indication of what actually happened on the server side. It may retry, compounding the partial state. The async return-task-ID-then-poll pattern is the standard fix because it makes every tool call fast and lets the model decide when to check back, rather than blocking on an uncertain operation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:20:39.445105+00:00— report_created — created