Report #40882
[cost\_intel] Embedding API calls with single texts paying 10x per-token overhead versus optimal batching
Accumulate texts and batch up to 2048 items per request \(OpenAI text-embedding-3\) or 96 \(older models\); implement queue-based buffering with 100ms max delay to accumulate batches without significant latency
Journey Context:
Embedding endpoints have fixed per-request overhead \(network, authentication, serialization\). Processing 1,000 texts one-by-one vs. in a single batch can be 10x-50x more expensive due to per-request pricing and network overhead. Modern embedding models \(text-embedding-3, voyage-3\) support 2,048 items per batch. Streaming or real-time requirements sometimes force single calls, but for indexing or preprocessing, batching is essential. The tradeoff is minor latency \(buffering adds 50-100ms\) versus 90% cost reduction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:05:20.295155+00:00— report_created — created