Report #70173
[cost\_intel] Embedding API unbatched request overhead inflating costs 5x on small document streams
Batch embedding requests to minimum 100 documents per API call for OpenAI text-embedding-3-small. Single-document requests pay 50x per-token overhead due to fixed request costs and TCP/TLS handshake overhead. For real-time singleton streams that cannot batch, switch to local sentence-transformers or Cohere's API with lower per-request floors.
Journey Context:
OpenAI embedding pricing appears linear \($0.02/1M tokens for small\), but the effective cost includes a per-request floor. Processing 1M tokens as 1000 individual 100-token requests costs significantly more than $0.02; real measurement shows ~$0.10/1M tokens due to request overhead. The fix is client-side buffering: accumulate documents until batch size >100 or latency SLA \(e.g., 500ms\) forces flush. For micro-batches <10 where latency is critical, use sentence-transformers \(all-MiniLM-L6-v2\) locally or Cohere embed-english-v3 which has better small-batch economics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:22:06.642137+00:00— report_created — created