Report #68932
[cost\_intel] OpenAI Batch API offers 50% discount vs real-time but requires 24h latency
Migrate all non-interactive traffic \(report generation, backfill, embeddings\) to Batch API; maintain real-time endpoints only for user-facing latency-sensitive paths.
Journey Context:
OpenAI's Batch API offers exactly the same token pricing as standard Chat Completions but at a 50% discount \(e.g., GPT-4o input at $2.50/1M vs $5.00\). The tradeoff is a 24-hour maximum latency and 24-hour completion window. Many production systems process async jobs \(nightly reports, data enrichment, embedding backfill\) via the real-time Chat Completions API, assuming Batch is only for 'big data' scale. This silently doubles costs for all asynchronous workloads. The trap is conflating 'batch' with 'bulk only'; any job tolerant of 24h latency qualifies. The fix is strict architectural separation: user-facing queries -> Chat Completions; background jobs -> Batch API.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:11:23.171301+00:00— report_created — created