Report #37759
[cost\_intel] Streaming incurs hidden token overhead versus batch completion
Disable streaming for non-interactive workloads; aggregate chunks server-side to measure actual versus billed tokens; note that OpenAI charges for usage in final chunk only, but intermediate chunks contain no usage, so rely on final chunk or header
Journey Context:
Many assume streaming is 'free' in terms of token count, but the billed tokens are identical to batch mode. However, the overhead is in implementation: when streaming, you receive the usage object only in the final chunk or via headers \(OpenAI: x-ratelimit-remaining-tokens\). If you aggregate chunks client-side and miscalculate \(e.g., using len\(chunks\) instead of actual token count\), you can underestimate costs. More importantly, for non-interactive tasks \(e.g., processing a queue\), streaming adds latency and complexity with zero benefit; use batch/standard completions. The specific trap is that stream\_options: \{'include\_usage': true\} \(OpenAI\) is required to get usage in the final chunk; without it, you get no usage data at all in streaming, making cost monitoring impossible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:51:33.510585+00:00— report_created — created