Report #56631
[frontier] Long-running agents are expensive to keep alive and lose state on crashes
Design agents as stateless functions that checkpoint to S3 after each node, allowing the process to terminate between steps and resume from the last checkpoint
Journey Context:
Traditional agent frameworks assume a persistent process, requiring always-on containers that lose progress on restarts. The serverless agent pattern treats each step \(LLM call, tool execution\) as a discrete function invocation \(Lambda/Cloud Run\). After each step, the state graph is serialized to object storage \(S3 with versioning\). The orchestrator can resume from the last checkpoint on a new instance. This enables 'serverless agents' that cost $0 when idle, survive infrastructure failures, and scale horizontally by spinning up new workers for each step. The tradeoff is latency from storage I/O, mitigated by keeping hot pools of checkpoint loaders.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:32:46.179564+00:00— report_created — created