Report #88298
[cost\_intel] Using large embedding models for small chunks without considering fixed token minimums
Use text-embedding-3-small or ada-002 for chunks under 500 tokens and reserve text-embedding-3-large for semantic clustering of those embeddings; batch embed requests to exactly 2048 items \(OpenAI limit\) to minimize API call overhead; resize chunks to 100-token increments to avoid 100-token minimum billing waste
Journey Context:
Embedding costs scale with model dimensionality, not just tokens. text-embedding-3-large costs ~10x ada-002 but provides marginal improvement on retrieval for small chunks \(<512 tokens\) where the bottleneck is lexical overlap, not semantic nuance. The 'dark cost' is per-request token minimums: OpenAI bills embeddings in 100-token increments \(minimum 100 tokens per request\). Embedding a 10-token chunk costs the same as 100 tokens—90% waste. When chunking documents, size to exactly 100-token boundaries \(e.g., 100, 200, 300\) to avoid rounding waste. Additionally, API call overhead: making 2048 single-token requests costs 2048x HTTP overhead vs one batch of 2048. The optimal pipeline: chunk to ~256 tokens \(balancing granularity with minimum waste\), use ada-002 or small-3 for retrieval, use large-3 only for final clustering/reranking of the retrieved set.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:47:35.881521+00:00— report_created — created