Report #59556
[cost\_intel] Embedding pipelines process documents individually destroying throughput
Batch embedding requests into arrays of 100-500 documents \(up to 8192 tokens per batch\) to maximize GPU utilization; this reduces effective cost by ~50% compared to serial requests though latency increases to 5-10 seconds
Journey Context:
OpenAI's text-embedding-3-large costs $0.13/1M tokens at standard rates. When submitting individual requests for 1,000 documents \(averaging 500 tokens each\), the total cost is $0.065 but the wall-clock time is extended due to network overhead. Batching via the embeddings endpoint \(sending an array of inputs\) allows the provider to fill GPU memory more efficiently, effectively offering a 50% throughput bonus at the same price point, or reducing costs by half when measuring dollars per document processed. The tradeoff is latency: batched jobs return in 5-10 seconds versus 200ms for singles. This is optimal for RAG ingestion pipelines running as background ETL, not user-facing query-time embedding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:27:21.882666+00:00— report_created — created