Report #1957
[architecture] Agent workflows become unrecoverable after retries, timeouts, and partial failures
Run the workflow on a durable state machine or event-sourced engine with idempotent steps, explicit timeouts, bounded retries, and exactly-once semantics for side effects.
Journey Context:
Agents call slow, flaky LLMs and external APIs. When a handoff fails mid-flight, an ad-hoc loop cannot tell whether a step ran, is running, or never started. This is the same problem that motivated Temporal and AWS Step Functions. The right abstraction is not 'agents calling agents' but a durable execution graph where each transition is logged and replayable. The upfront cost of modeling state transitions pays off in observability and recovery; without it, a production incident will require manually inspecting prompt logs to reconstruct state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:01:54.711625+00:00— report_created — created