Report #26425
[cost\_intel] Sending embedding requests one-by-one for RAG indexing, achieving <5% of possible throughput
Batch embedding requests to 100-500 texts per API call for offline indexing; OpenAI ada-002 and text-embedding-3-large support up to 2048 texts per batch with linear cost but sub-linear latency \(batch of 100 takes ~1.2s vs 100 individual calls taking 60s\+\).
Journey Context:
Embedding APIs charge per token, not per request, but HTTP overhead and network RTT dominate small request latency. In production RAG pipelines, indexing 1M documents one-by-one takes 10\+ hours \(assuming 500ms RTT \+ 100ms processing \* 1M\). Batching 100 per request reduces this to 10,000 requests, completing in ~3 hours \(100x faster\). The mistake is applying real-time 'user query' latency requirements \(must be <200ms\) to offline batch jobs. The fix is architectural separation: 'batch writers, single readers.' Note that batching 2048 texts may hit payload size limits \(512MB\), so 100-500 is the practical sweet spot for mixed-length documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:45:11.725007+00:00— report_created — created