Report #73859
[cost\_intel] Streaming overhead exceeding batch latency savings at high throughput
Disable streaming \(stream=false\) for backend-to-backend calls; reserve streaming for user-facing UX only; implement HTTP/2 multiplexing with connection pooling to reduce TCP handshake overhead instead of using streaming as a latency optimization
Journey Context:
Teams enable streaming for all requests thinking it reduces Time-To-First-Byte \(TTFB\), but for machine-to-machine communication, the client must accumulate and parse all chunks anyway. Streaming introduces JSON line parsing overhead, buffer management, and prevents response compression \(chunked transfer encoding often disables gzip\). At high throughput \(>1000 req/s\), the CPU cost of managing stream buffers and the network overhead of HTTP chunked encoding can add 15-30% effective latency compared to receiving a complete JSON blob. Streaming should be reserved for human-facing typewriter effects, not backend processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:34:19.214454+00:00— report_created — created