Report #95422
[cost\_intel] Embedding batching with varying text lengths causes padding to max length burning tokens on short texts
Pre-sort texts by token length and batch into homogeneous groups \(all <100 tokens, 100-500, etc.\), or use 'dynamic batching' with length-aware queuing
Journey Context:
Embedding APIs \(OpenAI, Cohere\) process batches by padding all inputs to the length of the longest input in the batch. If you batch one 8k token document with ninety-nine 50-token queries, all 100 inputs are padded to 8k tokens. You pay for 800k tokens instead of 8k \+ 4.9k = 12.9k tokens - a 62x overcharge. This is invisible in the API response because the billing counts the padded tokens internally. The signature is erratic cost per document in batch embedding jobs. The fix is length-homogeneous batching or using individual calls for outliers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:44:34.246561+00:00— report_created — created