Report #58813
[cost\_intel] Streaming responses appear cheaper but hide per-chunk overhead that increases total token count versus batch
Disable incremental streaming for final chunks under 500 tokens; verify that 'usage' field in stream\_end matches sum of delta contents to detect provider-side token inflation
Journey Context:
Developers assume streaming is token-cost neutral versus batch requests, but several traps exist. First, OpenAI only returns the usage object \(prompt\_tokens, completion\_tokens\) at the very end of the stream. If a client disconnects early \(e.g., user closes browser tab\), you pay for the streamed tokens but never receive the usage metadata required to bill your own customers or track costs. Second, while OpenAI and Anthropic do not inflate tokens for streaming, some providers or proxy layers add formatting tokens \(like 'data: ' SSE prefixes\) to the token count erroneously. Third, for very short responses \(<500 tokens\), the latency benefits of streaming are negligible, but the client-side overhead of handling chunks increases complexity without savings. The correct pattern is to use non-streaming \(batch\) mode for predictable, short completions where you need guaranteed usage metadata, and reserve streaming for long-form generation where latency matters. Always validate the final usage field against your own token counter to detect inflation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:12:18.140927+00:00— report_created — created