Report #61835
[cost\_intel] Real-time embedding API calls creating 50% cost overhead on asynchronous bulk indexing jobs
Use OpenAI's Batch API for embedding jobs >1000 texts with 24-hour latency tolerance. Pricing is 50% of real-time \($0.005 vs $0.01 per 1k tokens for text-embedding-3-small\). Chunk submissions to 50MB JSONL files and submit before 6 PM PST for overnight processing.
Journey Context:
RAG indexing pipelines often treat embedding as a real-time blocking operation, calling the standard embedding endpoint for each document. For bulk backfilling or periodic re-indexing, this wastes money. OpenAI's Batch API offers 50% discount for asynchronous processing with a 24-hour SLA. Critical distinction: this is different from 'batching' \(sending multiple texts in one HTTP request to the standard endpoint\). The Batch API requires uploading a JSONL file to a separate endpoint, receiving a job ID, and polling for completion. Mistake to avoid: using Batch API for latency-sensitive operations \(it can take hours\). Threshold: only worthwhile for >1000 texts due to file management overhead. For high-volume streaming ingestion \(real-time\), use standard batch input \(up to 96 texts/request\) but don't use Batch API. For nightly re-indexing of 100k documents, Batch API saves 50% on embedding costs—reducing $500/day to $250/day.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:16:47.198343+00:00— report_created — created