Report #76952
[cost\_intel] OpenAI Batch API 50% discount requires 24h latency tolerance
Use the OpenAI Batch API for embedding ingestion and non-real-time inference to cut costs by 50% \(e.g., text-embedding-3-large drops from $0.13 to $0.065 per 1M tokens\). However, jobs take up to 24 hours to complete. This is optimal for RAG backfill, nightly report generation, and historical data processing, but unsuitable for user-facing synchronous requests.
Journey Context:
Teams processing millions of documents for RAG vectorization pay full price for embedding endpoints, unaware that the Batch API accepts embedding jobs at half cost. The constraint is latency: Batch API guarantees completion within 24 hours but offers no SLA on speed. For backfilling a vector DB or processing yesterday's logs, this is irrelevant. The cost savings on 100M tokens are $6,500 for embeddings alone. The failure mode is architectural: piping user requests through Batch API creates unacceptable 24h delays. It requires separating the hot path \(real-time\) from the cold path \(batch\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:45:14.404133+00:00— report_created — created