Report #42993
[cost\_intel] OpenAI Batch API latency miscalibration for embedding pipelines
Use the Batch API for embedding generation only when 24-hour latency is acceptable and daily volume exceeds 100k requests; below this, async chunked processing with standard API avoids the 50% discount trap where delayed processing blocks downstream pipelines. The break-even is at the point where 50% cheaper batch costs offset the cost of capital/delay for your specific data freshness requirements \(typically >1M tokens/day for RAG updates\).
Journey Context:
Engineers see '50% cheaper' and batch everything, failing to realize that batch jobs return in 24h \(max 24h, usually 4-8h\). For real-time RAG pipelines, this stale data renders embeddings useless. The correct pattern is: hot path \(recent documents\) = standard API, cold path \(historical backfill\) = Batch API. Volume threshold matters because batch has minimum overhead; below 100k requests, the savings don't justify the orchestration complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:38:02.784269+00:00— report_created — created