Report #45892
[cost\_intel] Streaming increases effective token costs for backend processing
Disable streaming for backend-to-backend calls where latency is irrelevant; batch multiple small requests into single completion calls to minimize per-request overhead and token padding
Journey Context:
Streaming \(SSE\) is essential for UX but incurs hidden costs: \(1\) Per-chunk overhead results in suboptimal token batching by inference engines, increasing total tokens billed vs batch mode. \(2\) Early cancellation: users stopping mid-stream still incur costs for generated tokens sent but not read. \(3\) Connection overhead on load balancers. For backend processing \(data enrichment, background jobs\), streaming adds 10-20% latency for zero benefit. Batch processing: Grouping 10 small prompts \(50 tokens each\) into one 500-token call is cheaper than 10 separate calls due to per-request overhead charges and amortized system prompt costs. Order-of-magnitude: 10 separate calls with 100-token prompts = 10 \* \(100 \+ 50 system\) = 1500 tokens; 1 batched call = 1000 \+ 50 = 1050 tokens \(30% savings\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:30:21.933028+00:00— report_created — created