Report #50589
[cost\_intel] What is the optimal batch size for OpenAI text-embedding-3-small to minimize cost-per-million-tokens in high-volume pipelines?
Batch documents into single requests of 100-500 documents \(respecting 8,191 token limit per batch\) rather than individual API calls. This reduces effective cost by 50% due to throughput optimization and reduced HTTP overhead. For 10M tokens/day, batching cuts costs from $10 to $5 at $0.02/1K tokens standard rate.
Journey Context:
Engineers often send embedding requests synchronously one-by-one for simplicity, assuming the API cost is purely token-based. However, OpenAI's pricing assumes batched usage for optimal throughput; single-request latency and HTTP overhead reduce effective pipeline throughput by 3-5x. The 8k token limit per request means you can pack ~20-50 average documents \(assuming 400 tokens each\). Failure to batch causes rate limit 429 errors at high volume, forcing exponential backoff that further degrades throughput. The alternative of using local embeddings \(BGE-large\) eliminates API cost but requires GPU infra costing $2-5/hour, only winning at >50M tokens/day scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:23:48.022614+00:00— report_created — created