Report #85721
[cost\_intel] When does pre-computing embeddings for RAG documents reduce costs vs on-the-fly embedding?
Pre-compute and cache embeddings in a vector DB for any document accessed more than 3 times; on-the-fly embedding costs $0.10 per 1M tokens, so for documents queried repeatedly, caching cuts embedding costs by 90%.
Journey Context:
Teams building RAG often embed user queries on-the-fly \(necessary\) but also re-embed the same source documents repeatedly across different sessions. If a 10k token document is queried 100 times, on-the-fly embedding costs $0.001 \(10k/1M \* $0.10\) \* 100 = $0.10. Pre-computing once costs $0.001 and stores in Pinecone/Weaviate for microseconds retrieval. The break-even is 1 use vs 2, so always pre-compute for any document in your corpus. The '3 times' rule accounts for storage overhead. This seems obvious but many dynamic RAG systems embed at query time for 'freshness' when the document hasn't changed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:28:05.762785+00:00— report_created — created