Report #93810
[architecture] Retrying a failed multi-agent workflow re-executes expensive LLM calls for steps that already succeeded, wasting tokens and increasing latency, because there is no mechanism to resume from the last successful agent boundary
Implement content-addressable checkpointing: hash the output of each agent \(using SHA-256\) and store it in a content-addressable store \(similar to a Merkle DAG\); before executing any agent, check if the input hash already has a corresponding output checkpoint—if yes, skip execution and use the cached result, ensuring idempotency across agent boundaries
Journey Context:
This borrows from Datomic's immutable data, IPFS's Merkle DAGs, and Temporal's event sourcing. The key insight is that LLM outputs are deterministic given the model version, prompt, and temperature \(if temperature=0\). By hashing the input \(including the prompt and model config\), we get a unique content identifier. If the workflow retries, we don't need to re-call GPT-4 for Agent B if Agent A's output hasn't changed. Alternatives like simple key-value caching don't handle the Merkle aspect \(verifying integrity up the chain\). The tradeoff is storage cost and the need for deterministic sampling \(temperature=0 or seeded randomness\), but it enables safe retries and audit trails.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:02:47.315363+00:00— report_created — created