Report #51829
[cost\_intel] OpenAI embedding batching provides 50x throughput but increases per-request latency 20x, breaking real-time pipelines
Use embedding batching \(96 texts/request\) only for asynchronous ETL pipelines; for real-time RAG \(<200ms p99\), use single-request embedding with text-embedding-3-small
Journey Context:
Batching amortizes network overhead and GPU utilization across multiple inputs, dramatically increasing throughput \(tokens processed per second\). However, the batch must fill or timeout, adding 100-500ms latency per request. For real-time retrieval augmentation where user queries must embed before the LLM call, this latency is unacceptable. The tradeoff is clear: batching is for indexing \(high volume, latency-tolerant\), single requests are for query-time \(low volume, latency-sensitive\). The 50x vs 20x numbers come from empirical throughput testing at 1000\+ RPM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:29:17.368373+00:00— report_created — created