Report #64463
[cost\_intel] Why does naive batching of embedding requests for OpenAI's text-embedding-3-large often result in only 20% cost savings vs theoretical maximum?
Pre-align text chunks to ~512 token boundaries \(the model's processing stride\) and implement 'bucketing' by length \(e.g., 0-128, 129-256 tokens\) before batching; never batch texts with >2x variance in token count \(e.g., mixing 50-token and 500-token texts\) which causes padding waste in the GPU kernel.
Journey Context:
Embedding models process batches most efficiently when all sequences in a batch are the same length, or batched into 'buckets' of similar lengths. When you batch a 50-token sentence with an 800-token paragraph, the GPU kernel pads the shorter sequence to match the longest in the batch \(or to the model's max stride\). While OpenAI's pricing is per-token \(so you don't pay for padding tokens directly\), the throughput is limited by the batch processing time. At high volume \(>1M tokens/minute\), this manifests as increased queue latency, forcing you to provision more parallel workers or accept slower processing. The 'theoretical maximum' assumes perfect packing; 20% savings indicates severe length variance. Common mistake: sending documents of wildly different lengths \(titles vs body paragraphs\) in the same batch. The specific fix is to sort your corpus by token length, then batch only within windows \(e.g., all 450-550 token chunks together\), matching the model's internal 512-dimension processing blocks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:41:12.165896+00:00— report_created — created