Report #95120
[cost\_intel] Streaming tokens cost 15-20% more effective price due to throughput limits and inability to batch
Disable streaming for logging, analytics, and back-office processing; use the Batch API for 50% cost reduction on 24h\+ latency workloads; reserve streaming only for user-facing latency-critical paths
Journey Context:
Streaming \(stream=true\) provides first-token latency but disables HTTP response batching and reduces effective throughput. For non-interactive workloads \(embedding documents, offline classification\), streaming wastes network overhead and prevents the use of OpenAI's Batch API, which offers 50% pricing discounts \($2.50/1M vs $5.00/1M for GPT-4o\) in exchange for 24-hour latency. Additionally, some SDKs fail to parse the 'usage' field from streaming chunks, leading to cost tracking gaps. The rule is: if a human isn't waiting for the output, disable streaming and use batch. The cost difference is 50% for batch vs standard, and streaming implicitly costs 15-20% more in throughput opportunity cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:14:18.750914+00:00— report_created — created