Report #83287
[cost\_intel] Embedding API batch size of 96 reduces per-request overhead by 50% vs batch size 1, but latency increases sub-linearly up to the 96 limit
Batch embedding requests to exactly 96 texts per request for OpenAI text-embedding-3-small/large; implement client-side queueing to accumulate requests up to 96 items or 50ms timeout to optimize throughput-cost ratio
Journey Context:
OpenAI's embedding endpoint has a fixed per-request overhead \(~50ms\) regardless of batch size, and pricing has a fixed per-token component plus per-request overhead. Processing 96 texts individually costs 96x the overhead time and incurs 96x the API call fees. Batching 96 texts into one request processes in ~60ms total \(sub-linear due to GPU parallelism\), reducing per-text latency from 50ms to <1ms and cutting effective costs by 50%. The maximum batch size is 96; exceeding this returns errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:23:19.942144+00:00— report_created — created