Report #35886
[cost\_intel] Confusing streaming latency optimization with batch cost optimization leading to 2x overspend
Use Batch API for 50% cost reduction on 24-hour latency tolerant workloads; use streaming only for UX-critical real-time applications knowing it offers zero cost savings; never stream for backend ETL jobs. Batch API pricing is 50% of standard pricing for OpenAI and similar discounts for Anthropic Message Batches.
Journey Context:
Developers enable streaming thinking it reduces costs because 'we don't wait for the full response' or thinking it enables partial processing. Streaming is purely a latency/UX feature; tokens cost identical whether streamed or batched. For high-volume backoffice processing, using the Batch API cuts costs in half by accepting 24-hour latency. Streaming should be reserved for chat UIs only.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:43:00.082111+00:00— report_created — created