Report #98121
[cost\_intel] Streaming costs the same as sync, but Batch API cuts the bill 50% for async work
Route non-urgent workloads to the Batch API; keep streaming/sync only for real-time interactions. A 24-hour latency tolerance is worth a 50% discount on both input and output tokens.
Journey Context:
Streaming does not reduce per-token cost; it only changes delivery. Many teams stream everything by default because it feels faster, paying full price for back-end processing that does not need real-time responses. OpenAI's Batch API returns results within 24 hours at 50% off standard input and output rates. The hidden trap is choosing latency architecture based on UX habit rather than actual SLA need. The right split is: user-facing chat -> streaming, nightly jobs/evaluations/classification -> batch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:16:21.167843+00:00— report_created — created