Report #80423
[cost\_intel] Embedding batching achieves only 10% savings vs expected 50% due to padding waste
Pre-sort documents by token length and batch into homogeneous groups \(e.g., 0-100, 100-200 tokens\) to minimize padding; use exact provider max batch sizes \(e.g., 96 for text-embedding-3\) and pad to max sequence length within batch rather than global max; for variable-length streams, disable dynamic batching and use single-request synchronous calls to avoid padding tax on short sequences
Journey Context:
Embedding models process sequences in fixed-size tensors, padding all sequences to the length of the longest sequence in the batch. Naive batching \(random length distribution\) results in effective utilization of only 50-60% of token budget because a 10-token document batched with a 500-token document pays for 500 tokens for both. The 10% vs 50% savings discrepancy comes from providers charging for the compute \(padded length\), not the semantic content. Homogeneous batching \(sorting by length first\) ensures batches have similar lengths, reducing padding overhead from 400% to <10%. This requires client-side buffering and sorting, but recovers the theoretical 50% cost reduction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:35:49.445841+00:00— report_created — created