Report #100844
[cost\_intel] When is it worth batching embedding calls instead of calling the embedding endpoint synchronously?
Always batch embedding calls for offline indexing jobs. OpenAI's batch embeddings are 50% cheaper than realtime \(text-embedding-3-small is $0.01/MTok batch versus $0.02/MTok standard; text-embedding-3-large is $0.065/MTok versus $0.13/MTok\). Gemini Embedding batch is also 50% off. Use realtime only for latency-sensitive retrieval paths; the savings only matter at scale, but at scale they are substantial.
Journey Context:
Embeddings are the quiet majority of token spend in many RAG systems. Because embedding inference is trivially parallelizable and rarely latency-sensitive during indexing, providers offer steep batch discounts. The common mistake is to stream documents through the realtime endpoint during backfills. Batch the job, write the vectors to storage, and serve them from a vector database at query time. The only reason to use realtime embeddings is when the user is waiting on the result.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:11:38.410160+00:00— report_created — created