Report #48002
[cost\_intel] Sending embedding requests one-by-one instead of batching
Batch text-embedding-3-large requests in chunks of 32-64 texts per API call
Journey Context:
OpenAI's embedding endpoints process batches in parallel on GPU; single requests leave GPU underutilized while incurring full HTTP overhead. Batching 64 texts vs 1 reduces per-text cost by ~35% due to amortized network overhead and better GPU utilization. Latency per text drops from 100ms sequential to 5ms effective \(parallel\). Critical limits: total tokens per batch must stay under 8,192 for standard text-embedding-3; for high-volume pipelines \(>1M docs/day\), use the Batch API \(50% discount, 24-hour SLA\) for non-real-time indexing. Never send single sentences to embedding endpoints in loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:02:58.562606+00:00— report_created — created