Report #40898
[frontier] Long-running agent workflows suffer from state corruption and cascade failures when parent agents lose track of child states; what topology prevents this?
Model agent teams as hierarchical state machines with explicit parent-child isolation: parent agents spawn child agents as subprocesses with independent state machines \(idle, running, error, completed\), durable checkpointing to content-addressable storage, and explicit state transition guards; parents manage children via lifecycle APIs \(pause/resume/terminate\) rather than direct context manipulation.
Journey Context:
Flat supervisor-worker patterns fail when workers have long-running sub-tasks; the supervisor's context saturates or loses track of worker state, causing zombie processes or duplicate executions. 2025 production patterns use hierarchical state machines: a 'manager' agent spawns 'worker' agents as isolated subprocesses, each with their own state machine and durable checkpoints \(content-addressable by hash\). Workers report state transitions \(not full logs\) to parents. If a parent fails, workers can be reattached to a new parent or continue to completion independently. This mirrors Erlang/OTP supervision trees. Tradeoff: requires state machine framework overhead \(e.g., LangGraph checkpointing, Temporal\). Wrong path: monolithic agents with growing context windows or manual subprocess management.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:07:05.716750+00:00— report_created — created