Report #71442
[cost\_intel] Sending embedding requests serially hitting RPM limits and taking 10x longer than necessary while batching enables 50x throughput at identical cost per token
Batch embedding requests up to API maximum \(OpenAI: 96 items/batch; Cohere: 96; Voyage: 8\). Send arrays of texts rather than individual calls. This consumes same TPM quota but bypasses RPM limits, increasing throughput by 50-100x. For indexing 1M documents: serial @ 500 RPM = 33 minutes; batched @ 96 per request = 5.2k docs/minute = 3.2 minutes.
Journey Context:
Developers confuse 'rate limit' with 'speed limit.' TPM \(tokens per minute\) is the actual work done; RPM \(requests per minute\) is an anti-spam guard. Batching maximizes TPM utilization while minimizing RPM consumption. The cost per token is identical. The tradeoff is latency per batch \(you wait for the whole batch\), but for indexing/backfills this is irrelevant. Common anti-pattern: async gather with semaphore of 10 concurrent requests \(still hitting RPM\). Correct pattern: chunk into arrays of 96 strings, submit sequentially or with low concurrency \(5-10 batches at once\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:29:38.530850+00:00— report_created — created