Report #55489
[cost\_intel] Embedding API throughput limits force serial requests causing 5x latency cost
Batch embedding requests into arrays of 96 texts maximum for text-embedding-3 models; this achieves 50x throughput with identical per-token pricing but avoids RPM rate limits
Journey Context:
OpenAI's embedding endpoint accepts arrays of up to 96 input texts per request \(for text-embedding-3 models\), processing them in parallel on the backend. Sending 1000 texts as 1000 individual sequential requests hits rate limits \(RPM: 3000-10000 depending on tier\) and incurs network latency \(~200-500ms\) per request. Batching into 11 requests \(96 each\) reduces network overhead by 99% and completes in seconds vs minutes. Cost is identical \($0.02/1M tokens for 3-small\), but effective throughput increases 50-100x. Critical: total tokens per request limit is 8191 for 3-large, 8192 for 3-small; if individual texts exceed this, they are truncated or rejected. For RAG pipelines, chunk documents to <500 tokens to maximize batch density. Azure OpenAI has different limits \(max 96 per request for standard deployments, 24 for global standard\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:38:01.025190+00:00— report_created — created