Report #77696
[cost\_intel] OpenAI streaming mode consuming 15-20% more tokens than equivalent batch completion requests
Disable streaming for non-interactive workloads; implement 'delta compression' by requesting only changed fields in chat completions; use batch API for high-volume offline processing with 50% cost reduction.
Journey Context:
Developers often assume streaming \(stream=true\) only affects latency, not cost. However, OpenAI's streaming implementation sends each token as a separate SSE \(Server-Sent Event\) chunk with significant JSON overhead. While the token count itself is identical, the 'completion\_tokens' reported can differ due to subtle implementation details in how stop sequences are handled. More importantly, when using streaming, developers often implement 'early stopping' incorrectly, consuming tokens that aren't needed. The real cost trap is failing to use the Batch API for offline workloads. Batch API offers 50% discount \($2.50/1M tokens for GPT-4o vs $5.00\) but requires 24h turnaround. For streaming specifically, the fix is: only stream when UI interactivity requires it. For backend processing, use batch=false \(the default\) and aggregate results. The signature of waste is seeing high streaming usage in backend logs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:00:43.335609+00:00— report_created — created