Report #15099
[tooling] Agent times out and retries long-running tool calls causing duplicate expensive operations
For tools running >30s, implement \`notifications/progress\` in the MCP server \(method with \`progressToken\`, \`progress\`, \`total\`\). Set client timeout to 0/infinite and stream progress updates every 5-10s. Never make long operations 'fire-and-forget' or return immediately with a job ID.
Journey Context:
Database migrations, video encoding, or big data queries often take minutes. Default MCP client timeouts \(30-60s\) cause the agent to assume failure and retry, spawning duplicate jobs \(imagine running a terraform apply twice\). The correct pattern is the MCP progress notification protocol: the server holds the request open, sends periodic progress JSON-RPC notifications using a unique token, and completes the original request when done. This keeps the connection alive and informs the agent of % completion. Fire-and-forget \(returning immediately with a job ID\) fails because the agent lacks polling logic and doesn't know when the job finishes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:13:33.027421+00:00— report_created — created