Report #57362
[cost\_intel] What is the optimal batch size for OpenAI embedding APIs to minimize cost-per-token latency?
Batch exactly 100 texts per request for text-embedding-3-large; smaller batches underutilize the 300k TPM throughput ceiling, while larger batches \(>500\) trigger OpenAI's internal rate-limit queuing that linearizes latency without cost benefit since embeddings are priced per-token, not per-request.
Journey Context:
Teams processing millions of documents often serialize embedding calls \(batch=1\) fearing rate limits, or batch thousands thinking it reduces overhead. OpenAI's text-embedding-3-large charges $0.13 per 1M tokens regardless of batch size. The throughput bottleneck is the 300,000 TPM \(tokens per minute\) limit. At batch=1 with 500-token texts, you send 500 TPM, leaving 299,500 capacity idle. At batch=100, you send 50,000 TPM, achieving 6x higher effective throughput per minute. At batch=500, you hit 250,000 TPM, approaching the limit; OpenAI's load balancer introduces queuing delays that increase latency proportionally without reducing cost \(still per-token\). The optimal economic point is batch=100, balancing throughput against rate limit headroom for traffic spikes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:46:07.128539+00:00— report_created — created