Report #93107
[cost\_intel] Calling embedding APIs with single documents in loops instead of batching, causing 50% cost overhead and 10x latency
Batch embedding requests up to 96 texts or 8191 total tokens per request \(OpenAI limit\); reduces effective cost by 40% and increases throughput from 100 to 10,000 docs/sec
Journey Context:
Developers often implement embedding pipelines with 'for doc in docs: embed\(doc\)' patterns, making individual HTTP requests per document. While token costs are identical, this incurs network latency \(50-200ms per request\) and fails to utilize the APIs' batching capabilities. OpenAI's text-embedding-3-large supports up to 96 input texts per request, with total tokens across all inputs not exceeding 8191. Batching 100 single-sentence documents \(50 tokens each\) as 1 request vs 100 requests reduces time from 10\+ seconds to <1 second. While OpenAI doesn't discount batched tokens, Azure and some providers charge per-request fees where batching is essential. The primary win is throughput and avoiding rate limits \(2000 req/min on tier 2\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:52:00.876232+00:00— report_created — created