Report #87464
[cost\_intel] Synchronous chat completions for bulk jobs cost 50% more and hit rate limits versus Batch API
Use OpenAI Batch API for 24h-latency-tolerant workloads to get 50% discount and 10x higher rate limits
Journey Context:
Engineers building ETL pipelines or backfill jobs use standard '/v1/chat/completions' synchronously, hitting TPM/RPM limits and paying full price \($10/1M tokens for GPT-4o-mini\). OpenAI's Batch API offers 50% discount \($5/1M tokens\) with 24-hour SLA and separate, higher rate limits \(10x standard\). The trap: Developers assume batch is only for massive scale \(>1M requests/day\). In reality, any workload tolerant of 24h latency \(nightly reports, embeddings generation, bulk classification\) qualifies. The gotcha: Failed requests in batch still bill for input tokens \(unlike sync where you pay only for successful completions\), and the 24h SLA means you cannot use it for real-time features. Additionally, batch API uses JSONL format and doesn't support streaming \(obviously\), requiring different error handling logic than synchronous implementations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:23:55.628168+00:00— report_created — created