Report #87879
[cost\_intel] Processing embedding requests serially for RAG ingestion instead of batching
Batch embedding requests into chunks of 96 documents \(OpenAI limit\) or 100 documents \(Cohere limit\) per API call; this saturates TPM limits instead of RPM limits, reducing ingestion time by 10x and effectively halving infrastructure costs at scale compared to serial processing.
Journey Context:
Developers write RAG pipelines that loop through documents and call createEmbedding one by one. They hit rate limits \(RPM\) before token limits \(TPM\). OpenAI's embedding models allow up to 96 inputs per request \(or 8191 tokens per input, but the batch limit is 96\). By batching 100 documents into a single request, you use 1/100th of the RPM quota while maximizing TPM utilization. More importantly, the latency per document drops massively because network overhead dominates for small requests. At 1M documents, serial processing takes ~1000 minutes \(at 1000 RPM\), while batching takes ~10 minutes. The 'cost' is identical per token, but the effective cost of compute time and infrastructure is lower. Some providers \(like Cohere\) offer actual discounts for batching \(embed-v4 batch API is cheaper\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:05:27.447984+00:00— report_created — created