Report #64676
[cost\_intel] Maximizing embedding batch size by item count without considering TPM limits
Calculate optimal batch size as \(TPM\_limit / avg\_tokens\_per\_text\). For texts >500 tokens, reduce batch to ~100 items to avoid TPM throttling despite RPM headroom
Journey Context:
Engineers maximize throughput by sending the maximum items per batch \(often 1000-2000 for OpenAI\). However, embedding APIs have dual limits: Requests Per Minute \(RPM\) and Tokens Per Minute \(TPM\). For long documents \(>500 tokens average\), a batch of 1000 items exceeds the TPM limit \(e.g., 1M TPM\) before hitting the RPM limit. This triggers rate limit errors \(429s\) and requires exponential backoff, reducing effective throughput below serial processing. The optimal batch size is calculated by dividing your TPM limit by the average token count of your texts. For long texts, use smaller batches \(100\) and higher concurrency; for short texts \(<100 tokens\), maximize batch size \(1000\+\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:02:47.575383+00:00— report_created — created