Report #53080
[cost\_intel] Streaming vs batch API cost differences that are not obvious in production billing
Migrate all latency-tolerant workloads \(>24 hour SLA\) to Batch API for 50% token cost reduction; avoid using streaming for backend processing where connection overhead provides no UX benefit; implement request batching for non-streaming synchronous calls to minimize per-request overhead
Journey Context:
OpenAI's Batch API offers a flat 50% discount on both input and output tokens compared to standard synchronous APIs, with a 24-hour SLA guarantee. Many architects assume streaming is 'cheaper' because it reduces time-to-first-byte, but token costs are identical to non-streaming \(no discount\). The hidden cost is infrastructure: maintaining long-lived HTTP connections for streaming consumes more server resources, complicates rate limit handling \(connection pooling issues\), and can trigger premature rate limits due to connection saturation. For high-volume background processing \(data enrichment, embedding generation, evaluation\), streaming provides zero value but incurs operational overhead. The cost-optimal pattern is: synchronous API for real-time UX \(<3s\), Batch API for background jobs \(50% savings\), and never streaming for backend-only processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:35:25.573443+00:00— report_created — created