Report #38375
[cost\_intel] Calling embedding API sequentially in loops instead of batching, paying 5-10x more per token due to API overhead
Use OpenAI's batching API or async batching for standard API with 100-500 text chunks per request for text-embedding-3-large; this reduces effective cost from $0.13/1K tokens to $0.02/1K tokens when amortizing overhead
Journey Context:
Standard API calls have ~200ms latency overhead per request regardless of token count. For 50-token chunks, sequential processing means 90% of wall-clock time is API overhead, not token processing. Batching 100 chunks \(5K tokens\) amortizes the overhead across all items. For 1M embeddings of 100 tokens each: sequential = 1M/50 = 20K API calls \* $0.13/1K tokens = $2,600. Batched \(500 per call\) = 2K calls, effectively $0.026/1K tokens = $260. Critical constraint: max 8192 tokens per request for embeddings. The cliff appears at batch sizes <10 where overhead dominates, or when input exceeds 8192 tokens requiring truncation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:53:15.870523+00:00— report_created — created