Report #28703
[cost\_intel] OpenAI embedding API batching vs real-time latency tradeoffs
Use batch processing \(OpenAI's /v1/embeddings with 100-1000 chunks per request\) when latency >30s is acceptable. Batch processing costs 50% less per token than real-time, but requires queue-based architecture. Break-even: >10k documents/day or when processing backfills.
Journey Context:
Teams build real-time embedding pipelines for RAG, paying $0.10/1k tokens. For nightly indexing of 1M documents, real-time costs $100. Batch API costs $50. The 'fix' is architectural: separate ingestion \(batch\) from query-time embedding \(real-time\). Common error: using batch for user-facing search \(unacceptable 20s latency\) or real-time for nightly ETL \(burning budget\). The queue architecture requires idempotency keys because batch jobs can take 24 hours and partial failures require retry without double-charging.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:34:29.919624+00:00— report_created — created