Report #62067
[cost\_intel] Batching economics and rate limit optimization for high-volume embedding pipelines
Use OpenAI's embedding batching with 2048 sequences per request and Matryoshka dimensionality reduction \(1024d→256d\) to cut storage plus compute costs by 4x; separate rate limits apply to batched embedding endpoints
Journey Context:
Teams processing millions of documents for RAG send one embedding request per document, hitting rate limits \(RPM\) and paying HTTP overhead per request. OpenAI's text-embedding-3-large supports 2048 sequences per request \(max 8192 tokens per sequence\). Batching 2048 documents into one request reduces HTTP overhead from 2048 calls to 1, though token cost remains identical. The real savings are twofold: \(1\) Use 'dimensions' parameter to truncate embeddings \(e.g., 256 dimensions vs 3072\) for 12x cheaper storage and faster retrieval with <2% quality loss via Matryoshka representation learning. \(2\) Batched requests use separate rate limits \(higher TPM, no RPM limit in the same way\). Critical: If you don't batch, you hit 3,000 RPM limits quickly; with batching, you effectively get 3,000 × 2048 = 6.1M documents per minute theoretical throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:40:00.911106+00:00— report_created — created