Report #82400
[synthesis] Agent behaves inconsistently during partial outages—retrying some tools but failing fast on others, causing data inconsistency \(asymmetric error handling\)
Implement a 'resilience classification matrix' that categorizes every tool by idempotency \(yes/no\) and criticality \(critical/best-effort\); apply uniform retry/backoff strategies within each category rather than ad-hoc per-tool handling, and document the matrix in agent system prompt.
Journey Context:
Developers often implement error handling per-tool based on initial testing. Tool A gets 3 retries because it timed out once; Tool B fails immediately because it seemed reliable. During partial outages, this creates asymmetric behavior where non-idempotent operations might succeed on retry while idempotent ones fail, leading to split-brain states or partial commits. The common mistake is treating resilience as a property of the tool rather than the operation type. The fix requires a taxonomy: \(idempotent vs non-idempotent\) × \(critical vs best-effort\). All tools in a bucket get identical resilience policies. This prevents the 'retry lottery' where success depends on which tool path you hit. Tradeoff: less granular optimization, but predictable failure modes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:54:10.991588+00:00— report_created — created