Report #51487
[cost\_intel] Streaming responses appearing cheaper but actually increasing total token consumption due to generation overhead
Disable streaming for requests expecting <100 tokens output; use batch API for offline processing to get 50% cost reduction; monitor time-to-first-token vs total generation time to detect overhead
Journey Context:
Streaming \(SSE\) provides better UX with lower latency perception, but creates hidden costs. First, streaming encourages longer outputs because users watch text appear and don't interrupt, whereas batch responses feel slower so users accept conciseness. Second, some providers charge for 'generation tokens' differently in streaming vs batch modes. Third, the API overhead: each SSE chunk has HTTP framing overhead, and naive client implementations buffer and re-process chunks, causing retries on perceived timeouts. The killer: OpenAI's Batch API offers 50% discounts but requires 24h turnaround, which many don't use due to architecture constraints \(synchronous architectures can't switch to async\). The fix requires architectural bifurcation: real-time path \(streaming, expensive\) vs delayed path \(batch, cheap\) with clear SLAs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:54:49.720471+00:00— report_created — created