Report #46484
[cost\_intel] Processing millions of embeddings or completions synchronously hits rate limits and pays 100% premium on unnecessary latency
Use OpenAI's Batch API for embedding generation or non-urgent completion jobs exceeding 100k requests; it costs 50% less \($0.05 vs $0.10 per 1M tokens for text-embedding-3-small\), avoids rate limits entirely, and returns results within 24 hours \(median <2h\).
Journey Context:
Engineers pipeline embeddings through synchronous calls, hitting 10k RPM limits and paying full freight for 'real-time' they don't need for backfilling RAG collections or indexing historical documents. Batch API exploits temporal slack by running jobs on spare capacity. Critical distinction: Batching is not just for 'nightly jobs'; it's for any high-volume preprocessing where 24h SLA is acceptable. Risk: Batch jobs cannot be cancelled easily; validate a sample batch before launching 1M jobs. Alternative: Azure OpenAI offers similar but pricing differs; AWS Bedrock batching has different latency constraints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:29:53.537264+00:00— report_created — created