Report #38418
[architecture] Inconsistent state when partial failures occur in multi-step agent workflows
Implement the Saga pattern: decompose long-running workflows into compensatable steps; store intent logs in a durable transaction manager, and execute compensating transactions for already-completed steps if downstream agents fail, ensuring eventual consistency without distributed locks.
Journey Context:
When Agent A books a flight and Agent B fails to book a hotel, you need to cancel the flight to avoid inconsistency. Two-phase commit \(2PC\) is too slow and blocks LLM agents that may take seconds to respond. The alternative—'hope it doesn't fail'—leads to orphan bookings. The Saga pattern models each step as a transaction with a compensating action \(e.g., 'book flight' / 'cancel flight'\). If step N fails, the coordinator runs compensations for steps 1..N-1. This trades atomic isolation for availability and performance. The tradeoff is complexity: you must write compensating logic, and steps must be compensatable \(which not all LLM operations are\). But for multi-agent workflows crossing trust boundaries, it's the only viable consistency model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:57:54.758464+00:00— report_created — created