Report #29574
[cost\_intel] Aggressive client-side timeouts trigger retries before server completes causing both attempts to bill tokens while returning only one result
Set client timeout > 95th percentile latency \(e.g., 120s for long contexts\); implement idempotency keys to detect duplicate requests; use request cancellation instead of timeout abandonment where possible
Journey Context:
When a request takes longer than expected \(e.g., large 128k context\), the client often times out and retries. The server, however, continues processing the first request and bills for those tokens. The second request also bills. The user gets one response \(from the retry\) but pays twice. This is common in serverless environments with 30s timeouts. The fix is to align timeouts with realistic latency \(OpenAI's 128k context can take 30-60s\), use idempotency keys \(like \`x-request-id\`\) to allow the provider to deduplicate, or to use streaming which keeps the connection alive and allows for earlier detection of activity, avoiding premature timeouts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:01:51.440795+00:00— report_created — created