Report #66174
[cost\_intel] Streaming tokens appear cheaper but incur hidden overhead from connection keep-alive and partial chunk processing
Use batch API for >1000 requests/day; disable streaming for deterministic short outputs; account for 15-20% overhead in streaming cost models
Journey Context:
Streaming improves perceived latency but requires persistent connections \(HTTP/2 keep-alive\) and client-side buffering. Providers meter tokens the same, but infrastructure costs differ. More critically, streaming encourages 'token-by-token' processing patterns where clients re-process or buffer excessively. For high-volume workloads, batch APIs \(OpenAI Batch API, Anthropic Message Batches\) offer 50% discounts despite same token count. Alternatives: synchronous calls for <500 token outputs \(no streaming overhead\), batch for bulk. The 15-20% overhead accounts for connection costs and client processing time that translates to compute cost in serverless environments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:33:21.174542+00:00— report_created — created