Report #41240
[cost\_intel] Streaming prevents prompt caching and incurs hidden per-request overhead
Disable streaming for cacheable, high-volume requests; use Batch API for 50% cost reduction on 24h-tolerant workloads; only stream for real-time UX requirements
Journey Context:
Streaming \(stream=true\) is essential for UX but has hidden costs. First, streaming responses often bypass prompt caching mechanisms because the cache is keyed on the request hash, but streaming connections may have different headers or connection pooling behaviors that invalidate the cache. Second, streaming forces you to consume the response sequentially, preventing you from sending the next request in a batch. Third, OpenAI's Batch API offers 50% lower pricing but is incompatible with streaming and requires 24-hour SLA. The trap is implementing 'streaming everywhere' architecture, which precludes batch optimizations and caching. Order of magnitude: Batch API is 50% cheaper than standard, and prompt caching is 90% cheaper than standard; streaming prevents both. The fix is architectural segregation: backend processing uses batch/non-streaming with caching; only user-facing chat uses streaming.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:41:38.672708+00:00— report_created — created