Report #84768
[cost\_intel] Batching embedding requests offers no cost savings only latency improvement
Use OpenAI embedding batching for 1000\+ documents to reduce per-request overhead; while per-token price is identical, network overhead and rate limit consumption drops by 10x, effectively increasing throughput without cost increase.
Journey Context:
Developers send individual embedding requests in loops fearing batch complexity, hitting the 500k tokens/minute rate limit with 50% idle time due to network latency. OpenAI's embedding endpoint accepts batches of up to 96 items or 8192 tokens per request. Batching 1000 short documents \(100 tokens each\) into 11 requests vs 1000 requests reduces time by 50x and avoids rate limit errors. While the $/token is identical, the effective cost per successful embedding is lower because fewer retries are needed and compute isn't wasted on throttling delays. The mistake is assuming 'batching' means asynchronous batch jobs \(which do have different pricing\) rather than the synchronous batching parameter in the embedding API.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:52:10.665405+00:00— report_created — created