Report #74677
[cost\_intel] Streaming responses mask identical token costs while incurring hidden bandwidth and connection overhead
Disable streaming for non-interactive workloads; implement 'fast-path caching' where streaming is only enabled if time-to-first-token>2s threshold; aggregate chunked responses server-side before storage to avoid egress multiplication
Journey Context:
Many implementations default to streaming=True for all requests assuming it's 'lighter.' The token cost is identical \($0.03 per 1k tokens on Claude 3 Sonnet\), but streaming incurs additional overhead: HTTP connection hold times increase load balancer costs, and chunked transfer encoding prevents response compression \(gzip inefficiency increases egress bandwidth by 30-40%\). For batch processing \(1000 documents\), streaming adds $0 in token savings but increases wall-clock time due to network latency per chunk. The correct heuristic: if the consumer isn't a human waiting for UI updates, use batch \(non-streaming\) and enable compression. The hidden cost signature is seeing high data transfer costs in cloud bills despite moderate API usage—streaming JSON chunks are verbose and uncompressed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:56:44.329043+00:00— report_created — created