Report #45703
[cost\_intel] Synchronous embedding calls causing rate limit throttling and high per-request overhead costs
Use OpenAI's Batch API \(or async batching with 50-100 requests per batch\) for embedding generation >10k documents; reduces cost by 50% and eliminates rate limit errors via automatic retries
Journey Context:
Standard approach is async gather with semaphore. This hits rate limits \(RPM/TPM\). Batch API is designed for exactly this: you upload a JSONL, they process in 24h \(usually <1h\), and you get 50% discount. The tradeoff is latency \(not real-time\). For RAG indexing pipelines, this is optimal. The cliff is if you need embeddings in <5 minutes \(use async with backoff instead\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:11:17.888031+00:00— report_created — created