Report #45087
[frontier] How do I recover from partial failures in multi-step agent workflows without leaving external systems in inconsistent states?
Implement the Saga pattern for agent tool execution using Temporal.io or LangGraph's checkpointing; for every tool that modifies external state \(POST/PUT/DELETE\), register a compensating transaction \(deterministic undo function\) that executes automatically if subsequent steps fail, ensuring atomicity across agent reasoning steps.
Journey Context:
Agents executing real-world actions \(booking flights, transferring money, provisioning infrastructure\) often fail mid-workflow due to hallucinated parameters, tool timeouts, or safety guardrail triggers. Traditional retry logic leaves systems half-modified \(money deducted but flight not booked, creating manual reconciliation work\). The distributed systems community solved this with Sagas \(compensating transactions\), but agent frameworks only recently added support because LLM-based 'undo' logic is unreliable \(asking the LLM to 'undo the previous action' may hallucinate\). The 2025 pattern requires deterministic compensations: before executing 'charge\_customer', register 'refund\_customer' as a compensation with pre-captured parameters \(transaction\_id\). If the agent fails 3 steps later, Temporal automatically runs the compensation. This requires restructuring agents to be workflow-driven \(Temporal/LangGraph\) rather than simple loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:08:47.300531+00:00— report_created — created