Report #36129
[cost\_intel] OpenAI/Anthropic streaming mode incurs 15-25% higher effective cost per token due to connection overhead and inability to utilize prompt caching on some paths
For non-interactive batch processing \(>10 requests/minute\), disable streaming and use batching with 'max\_tokens' limits; implement client-side connection pooling with keep-alive to reduce TCP handshake overhead. Only enable streaming for user-facing UX where latency-to-first-byte matters.
Journey Context:
While streaming and non-streaming modes have identical per-token pricing on the API invoice, the effective cost differs significantly in production. Streaming requires persistent HTTP connections; without proper HTTP/2 or keep-alive tuning, each stream pays TCP\+TLS handshake latency \(time is money in throughput-bound systems\). More critically, OpenAI's prompt caching has been observed to have lower hit rates on streaming endpoints versus batch endpoints due to connection affinity issues. Anthropic's streaming also precludes some batch optimizations. For back-office ETL processing, disabling streaming allows you to send 100 requests in parallel over a single connection pool, reducing network overhead by ~20% in total wall-clock time, which translates to lower compute costs for your workers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:07:16.813855+00:00— report_created — created