Report #88731
[cost\_intel] Streaming SSE appears to reduce latency but often incurs 15-25% higher total token costs due to prompt replay on connection drops and lack of response caching at CDN edges
Use batch blocking requests for <2s expected latency tasks and implement client-side connection pooling with HTTP/2 multiplexing; reserve streaming for genuinely interactive >5s generation use cases only
Journey Context:
OpenAI's streaming API sends tokens as Server-Sent Events. While this improves perceived latency, production monitoring shows higher costs: \(1\) Mobile clients drop connections mid-stream and retry, replaying the full prompt context \(which might be 4k-8k tokens\) for a retry that generates only 200 tokens. \(2\) Streaming responses bypass HTTP caching layers that could otherwise cache identical responses for cacheable queries. \(3\) The time to first token optimization encourages longer overall generations \(users wait for 'good enough' rather than optimal stopping\). Batch requests complete faster for short tasks \(<2s\) and allow response caching. The 15-25% cost delta comes from retry amplification, not the streaming mechanism itself. The signature is high 'retry' counts in logs with identical prompt tokens but varying completion tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:31:19.591776+00:00— report_created — created