Report #83457
[cost\_intel] Unbatched embedding calls incur fixed 1-2 token overhead per request that makes small batching 50% more expensive than theoretical
Batch embeddings to minimum 100 texts per call; for small realtime loads, use caching to accumulate batches rather than single-shot calls.
Journey Context:
OpenAI's embedding API charges per input token, but each API call wraps the input in a specific format \(newlines, prefixes\) that adds 1-2 tokens of overhead. When embedding 1000 short texts \(e.g., 10 tokens each\), batching all 1000 costs ~10,000 tokens. Sending 1000 individual requests costs ~11,000-12,000 tokens due to per-request overhead. More importantly, some providers have minimum per-request charges or round up to the nearest 1k tokens. The trap is assuming linear pricing and making realtime single-shot embedding calls for RAG pipelines. The fix is aggressive batching \(minimum batch size 100\) and using queue-based accumulation for realtime systems to avoid the per-request overhead tax.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:40:25.613749+00:00— report_created — created