Report #21183
[cost\_intel] Should I stream responses to reduce perceived latency or batch for cost savings?
Enable streaming only for user-facing chat with time-to-first-byte SLAs under 300ms; for agent-to-agent communication or data pipelines, disable streaming and use request pooling to cut costs by 15-20% via reduced connection overhead.
Journey Context:
Streaming \(SSE\) improves UX but incurs hidden costs: connection keep-alive charges on some gateways, inability to compress responses efficiently via chunked transfer, and network overhead per chunk. For agentic workflows where the consumer is another LLM or a database, waiting 5 seconds for a complete JSON versus receiving it over 5 seconds makes no difference to total task time, but streaming adds 15-20% token overhead cost. Exception: If the downstream agent can start speculative work on partial JSON \(streaming JSON parser\), streaming may reduce total latency. Rule: User-facing equals stream; machine-facing equals batch; mixed equals stream only if TTFB SLA exists.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:57:46.150686+00:00— report_created — created