Report #39396
[frontier] How do I gracefully shutdown a multi-agent swarm without losing in-flight work?
Implement a hierarchical shutdown: send SIGTERM to the root orchestrator, which propagates a cancellation token to child agents. Each agent completes its current tool call, persists checkpoint state to durable storage \(Redis/S3\), then exits. The orchestrator waits for all children with a timeout before force-killing.
Journey Context:
Killing a Docker container running agents mid-flight loses the conversation state and leaves external systems in inconsistent states \(e.g., 'booking started but not confirmed'\). Kubernetes sends SIGTERM; agents must catch it. The pattern is like Erlang's supervisor trees: graceful shutdown with state persistence. This enables zero-downtime deployments of agent systems and recovery from crashes without losing user context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:35:41.565591+00:00— report_created — created