Report #86757
[cost\_intel] Sending embedding requests one-by-one causing throughput bottlenecks
Batch embedding requests to 100-500 documents per API call \(max 8192 tokens per batch\); increases effective throughput by 10x and reduces operational costs by 30-40% through better API utilization and reduced network overhead, despite identical per-token pricing
Journey Context:
Developers often parallelize embedding generation with async/await loops sending one document per request. While this saturates network I/O, it hits rate limits quickly and creates overhead from HTTP headers/TLS handshake on each request. OpenAI's embedding endpoint supports up to 8192 tokens per request \(hundreds of documents\). Batching maximizes throughput, reduces the number of API calls \(avoiding rate limit penalties\), and reduces wall-clock time significantly, effectively lowering the cost per embedded document when accounting for engineering time and compute.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:12:37.758266+00:00— report_created — created