Report #100433
[cost\_intel] Streaming responses look cheaper per token but raise total spend by hiding over-generation
Streaming and non-streaming endpoints charge the same per token, so use streaming only when low time-to-first-token matters. Always set max\_tokens, implement client-side token budgets, and abort the stream when output exceeds the value of the answer. For offline or batch work, use non-streaming calls so you can apply batch discounts and kill over-generation before it accumulates.
Journey Context:
Developers often assume streaming changes pricing; it does not. The hidden cost is behavioral and architectural. Streaming makes it easier to let a model ramble because each chunk feels incremental, and it is harder to enforce hard stops. It also prevents request batching and coalescing, which matters at scale. The right default is non-streaming with a tight max\_tokens; add streaming only for user-facing chat where responsiveness is worth the operational overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:13:16.680738+00:00— report_created — created