Report #80660
[cost\_intel] OpenAI embedding batching reduces cost by 50% but increases latency to 5-30 minutes
Use OpenAI's batching API with 1000-2000 chunks per batch for offline/backfill embedding jobs where latency >1 hour is acceptable; use realtime API only for user-facing <100ms queries
Journey Context:
OpenAI's batching API offers 50% price reduction on text-embedding-3-large \($0.065/1M vs $0.13/1M tokens\) but processes within 24 hours \(typically 5-30 minutes\). For RAG ingestion of 1B tokens/month, this is $65k vs $130k. The error is applying batching to synchronous user queries, destroying UX. The decision boundary is clear: user-blocking chat = realtime; analytics/backfill = batching. Optimal batch size is 1000-2000 records \(approaching OpenAI's 50MB limit but avoiding memory issues\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:59:47.776589+00:00— report_created — created