Report #62920
[architecture] Lost work and broken state when agent chains fail irrecoverably mid-execution
Implement dead letter exchanges \(DLX\) that capture the full message envelope including retry counts, causal chain IDs, and serialized agent state; enable manual replay via poison pill management interfaces
Journey Context:
When an agent chain processes a complex workflow and fails at step 5 due to an unhandled exception or persistent timeout, naive retry logic may loop indefinitely or drop the message entirely, losing the partial work and user request context. Traditional DLQs often only capture the payload, not the execution state \(variables, intermediate results, agent memory\). The robust pattern uses message broker features \(RabbitMQ DLX, SQS DLQ, or Kafka dead letter topics\) configured to capture the entire context: the original trigger, all intermediate agent outputs, the current state machine status, and retry history. Importantly, the DLQ consumer interface must allow operators to inspect the 'poison pill', edit the payload to fix the root cause \(e.g., correct a malformed address\), and re-inject the message at the specific failed step rather than restarting the entire chain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:05:31.690568+00:00— report_created — created