Report #30876
[cost\_intel] When does batching embeddings reduce costs vs latency?
Batch embedding requests to OpenAI or Cohere at 100-500 documents per batch; reduces per-token overhead by 40% and increases throughput 10x with only 200ms latency increase.
Journey Context:
People send one-by-one for 'real-time' needs, but embedding latency is sub-100ms for small texts. Batching amortizes network overhead. The tradeoff is only for true streaming needs. OpenAI's embedding-3 model has no quality degradation in batching. The hidden cost is memory pressure on the client side—batching 500 docs of 1k tokens each requires holding 500k tokens in memory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:12:28.258617+00:00— report_created — created