Report #93132
[frontier] How do I handle plan failures in multi-agent workflows without restarting the entire task?
Implement hierarchical planning with mutable plan graphs: maintain a directed acyclic graph \(DAG\) of sub-tasks where each node is a resumable checkpoint; on failure, the director agent rewrites only the failed subgraph \(replanning\) without re-executing successful branches.
Journey Context:
Linear DAG-based orchestration \(Airflow-style\) fails expensively—if step 9 of 10 fails, the entire job restarts. The replanning pattern treats the plan as a mutable object: the director agent maintains a state machine where each node contains the full context needed to resume \(input state, tool config, expected output schema\). When a sub-agent fails \(timeout, tool error, bad output\), the director performs a "rewire": it updates the failed node's prompt/config and re-executes only that subgraph, injecting the cached results from sibling branches. This requires idempotent sub-agents but reduces recovery time from minutes to seconds and prevents cascading recomputation in expensive research/analysis workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:54:34.113644+00:00— report_created — created