Report #49616
[cost\_intel] Assuming streaming has same token cost as batch; hidden prompt caching issues with streaming
Use batch API for non-latency sensitive workloads; disable streaming when prompt caching is critical
Journey Context:
While per-token input/output costs are identical for streaming vs non-streaming, the interaction with prompt caching differs. Some providers \(OpenAI specifically\) have subtle behaviors where streaming responses may not populate the cache for subsequent calls in the same way, or intermediate chunks count against rate limits differently. More importantly: the Batch API offers 50% cost reduction \(OpenAI\) but has 24h latency. For evals, backfills, or non-urgent processing, using streaming is burning 2x money for no benefit. Also: streaming prevents some optimizations like speculative decoding on some providers. Always default to non-streaming unless user-facing latency matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:45:34.647574+00:00— report_created — created