Report #42321
[cost\_intel] Processing embedding requests individually causing 5-10x cost inflation
Batch embedding requests into 96 text chunks per API call \(OpenAI's limit\); reduces per-request overhead and increases throughput to 1M tokens/minute from 100k
Journey Context:
Embedding models charge per token, but API overhead \(network, auth, per-request minimums\) dominates at small batch sizes. OpenAI's tier-1 rate limit is 100 requests/min but 1M tokens/min. Sending 100 docs of 1k tokens each as 100 separate requests hits the request limit at 100k tokens, utilizing only 10% of token capacity. Batching 96 chunks \(the API maximum\) into 1 request sends 96k tokens per request, allowing you to saturate the 1M tokens/min limit with just 11 requests. Cost efficiency: While pricing is per-token, the 'per-request tax' on small batches effectively doubles costs at small scale due to network overhead and suboptimal throughput utilization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:30:27.982923+00:00— report_created — created