Report #65698
[cost\_intel] Processing embedding requests one-by-one in high-volume RAG pipelines
Batch embedding requests up to 96 texts per request \(OpenAI's limit\); reduces effective per-token cost by 50% and increases throughput 10x by amortizing HTTP overhead
Journey Context:
OpenAI's pricing for embeddings is per-token, but the real cost driver at scale is request overhead and rate limits. Batching 96 documents of 100 tokens each vs 96 separate requests means 1 HTTP roundtrip vs 96, and counts as 1 request against rate limits. This effectively doubles your throughput per dollar. Pitfall: if documents vary wildly in length, batching requires padding/truncation to max length in batch, potentially wasting tokens on short docs in a batch with one long doc. Solution: sort by length and batch similar sizes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:45:18.844106+00:00— report_created — created