Report #43048
[cost\_intel] OpenAI embedding API costs 3x higher than expected with small chunks
Enforce minimum chunk size of 300 tokens; batch multiple small documents into single API calls up to 8191 token limit; use text-embedding-3-small for chunks <500 tokens \(1/10th cost of large\); store vector IDs to avoid re-embedding unchanged content.
Journey Context:
OpenAI's embedding models charge per input token, with text-embedding-3-large at $0.13/1M tokens. Naive RAG implementations chunk documents into 100-200 token pieces to 'improve precision,' then embed each separately. However, the API overhead \(HTTP request, processing\) is per-call, and small chunks underutilize the 8191 token limit. Worse, retrieved chunks carry metadata overhead \(source IDs, timestamps\) that bloat the prompt when retrieved, effectively causing the 200-token chunk to consume 400-500 tokens in the generation phase. The math: 10,000 chunks of 100 tokens = 1M tokens embedded; but if batched into groups of 81 \(8191/100\), you'd make only 123 API calls instead of 10,000, saving massive request overhead and time. The fix is chunking strategy: minimum 300-500 tokens unless semantically necessary, and aggressive batching of small texts into single embedding calls up to the 8k limit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:43:46.118436+00:00— report_created — created