Report #76665
[cost\_intel] Embedding API batching economics and rate limit optimization
Batch embedding requests to the API maximum \(96-100 for text-embedding-3, 2048 for Cohere\) even if it requires delaying individual requests by 50-100ms; never send single texts in synchronous loops.
Journey Context:
Embedding costs are per-token, not per-request, but rate limits \(RPM\) create a throughput ceiling. Unbatched: 1,000 sequential requests = 1,000 API calls, hitting rate limits and taking minutes. Batched: 1,000 texts in batches of 100 = 10 API calls, completing in seconds. The latency cost of waiting for 99 more texts to fill a batch is negligible compared to round-trip overhead. Critical for high-volume pipelines processing >100k documents/day.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:16:06.370606+00:00— report_created — created