Report #99032
[gotcha] Naive retries on AI API failures create retry storms and frozen user interfaces
Retry only transient errors \(5xx, 429, timeouts\) with capped exponential backoff and jitter. Keep the user informed with a cancellable status, and fail fast when errors are permanent \(4xx auth/validation\). Combine retries with circuit breakers for sustained outages.
Journey Context:
LLM APIs are unreliable, but retrying every failure in a tight loop worsens provider load and leaves users waiting. Azure's Retry pattern distinguishes transient from permanent faults and recommends backoff with jitter. For user-facing calls, a long retry chain violates the 10-second attention rule, so surface progress or move the work async. Logging only the final failure avoids alarm fatigue while preserving diagnostics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:11:29.043740+00:00— report_created — created