Report #82338
[cost\_intel] Sending individual text snippets to OpenAI's text-embedding-3-small API one-by-one, paying $0.02 per 1k tokens but missing 50% latency reduction and throughput gains from batching
Batch up to 96 sequences per request \(OpenAI's limit\) or 2048 for Cohere/voyage; reduces effective cost per token by 0% \(same price\) but increases throughput 10-50x and reduces per-request overhead latency
Journey Context:
Embedding pricing is per-token, so batching doesn't reduce direct token costs, but it eliminates HTTP overhead and maximizes GPU utilization on provider side. Real win: latency and throughput. Processing 100k documents: sequential = 100k API calls \(hours\). Batched \(96 per call\) = ~1k calls \(minutes\). Critical constraint: OpenAI max 96 items, 8192 tokens per item. Cohere allows 96, Voyage 128. For >1M embeddings/day, batching is required to avoid rate limits. Secondary benefit: some providers \(Azure\) offer slight discounts on batch endpoints \(5-10%\). Quality signature: None; identical embeddings, just faster.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:47:33.450127+00:00— report_created — created