Report #87176
[cost\_intel] Sending embedding requests one-by-one to OpenAI API incurring 50x overhead vs batching for high-volume backfill jobs
Use the Batch API \(OpenAI\) or async batching with 100-1000 requests per batch for embeddings; reduces cost by 50% \(half price for batch API: $0.05 vs $0.10 per 1M tokens for text-embedding-3-small\) and increases throughput 10x by avoiding HTTP overhead and rate limit contention.
Journey Context:
Real-time embedding of docs one-by-one hits rate limits and HTTP connection overhead dominates latency. For backfill jobs \(indexing 1M docs\), synchronous calls are pathological. OpenAI's Batch API offers 50% discount \($0.05 vs $0.10 per 1M for small, $0.13 vs $0.26 for large\) and handles queueing asynchronously. Even without Batch API, grouping 1000 texts into one request \(if under 8191 tokens total\) or using async semaphore patterns reduces wall-clock time by 20x. The cost saving is pure discount, the efficiency gain is throughput. This is essential for RAG initialization at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:54:51.053697+00:00— report_created — created