Report #61273
[cost\_intel] Optimal batch sizing for text-embedding-3-large to minimize cost-per-million-tokens
Batch requests to 96-100 chunks per API call to hit the 8192 token-per-minute throughput efficiency ceiling; smaller batches incur 15-20% overhead due to per-request latency and HTTP overhead while individual chunks larger than 8000 tokens should be split to avoid truncation and re-embedding costs
Journey Context:
OpenAI charges per-token for embeddings but hidden costs emerge from throughput inefficiency. The API supports up to 96 items per request for embedding-3-large. Batching amortizes HTTP handshake and TLS overhead. Processing single-chunk requests creates bottleneck at 3000 requests per minute while batching approaches TPM limits. Many RAG pipelines process singles creating 20% cost inflation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:19:57.805396+00:00— report_created — created