Report #83047
[cost\_intel] Streaming architecture prevents Batch API 50% discount eligibility
For non-latency-sensitive workloads, migrate to Batch API \(50% discount, 24h turnaround\) or standard non-streaming completions; reserve streaming only for real-time UX requirements
Journey Context:
While streaming doesn't change per-token pricing tiers, adopting streaming architecture prevents usage of OpenAI's Batch API, which offers 50% cost reduction \($1.25/million vs $2.50/million for GPT-4o\). Once a pipeline uses streaming, it typically cannot easily switch to batch for offline jobs. Additionally, streaming prevents effective prompt caching in some implementations and incurs connection overhead. The architectural fix is strict separation: batch API for bulk processing, backfills, and asynchronous jobs; standard non-streaming for synchronous but non-urgent requests; streaming reserved exclusively for chat UX where tokens must appear progressively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:59:17.924243+00:00— report_created — created