Report #55692
[cost\_intel] OpenAI embedding API batching vs real-time latency cost tradeoff
Batch embedding requests to 100 texts/request for 50% throughput gain; never batch if p99 latency requirement <500ms as batching adds 200-400ms serialization overhead
Journey Context:
Engineers send embedding requests individually to minimize latency, but OpenAI's embedding endpoint \(text-embedding-3-small\) processes batches of up to 100 input texts per request at the same per-token rate. Batching amortizes HTTP overhead and increases throughput 50-100x, crucial for indexing pipelines processing millions of documents. However, batching introduces serialization latency \(queuing 100 texts before sending\) and deserialization complexity. For real-time user-facing features \(e.g., live semantic search as-user-types\), p99 latency requirements <500ms are violated by batching delays. The economic break-even is volume: <100 embeddings/minute favors real-time individual requests; >1000/minute mandates batching to avoid rate limits and maximize throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:58:25.687385+00:00— report_created — created