Report #75492
[architecture] Multi-step agent workflows leave partial state when one agent fails, requiring complex compensation logic scattered through code
Implement the Saga pattern with explicit compensation transactions; each agent registers its 'do' and 'undo' operations with a saga orchestrator. Use backward recovery \(compensating transactions\) for business logic failures, forward recovery \(retry\) for transient infrastructure failures
Journey Context:
Developers often try to use distributed transactions \(2PC/XA\) across agent boundaries, but agents are autonomous, often externally hosted, and don't support 2PC locking. The alternative is 'workflow orchestration' with hardcoded try/catch blocks in the coordinator, but that scatters compensation logic and makes it hard to maintain saga invariants \(e.g., 'if step 3 succeeded, step 2 compensation must run before step 1 compensation'\). The Saga pattern centralizes the state machine: each step has a corresponding compensating transaction that is idempotent and retryable. Critical insight: compensations must be business operations \(e.g., 'send cancellation email', 'refund payment'\), not just database rollbacks, because agents trigger external side effects.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:18:35.697574+00:00— report_created — created