Report #57538
[cost\_intel] Calling embedding models \(text-embedding-3-large\) synchronously for backfill jobs or large-scale indexing, hitting rate limits and paying premium per-request overhead instead of using Batch API
Use OpenAI's Batch API \(or Google Cloud Vertex AI batch prediction\) for embedding jobs; get 50% cost reduction and avoid rate limits, at the cost of 24h latency.
Journey Context:
Online embeddings charge full price and hit TPM/RPM limits quickly. For indexing 10M documents, that's a nightmare. Batch APIs are designed for this: 50% off, higher throughput, but you wait up to 24 hours. The mistake is thinking real-time is needed when it's actually a nightly job. Also, some providers \(Azure\) offer 'standard' vs 'global' deployment types with different pricing. Ensure your vectors aren't changing frequently if using batch \(stale data risk is low for static docs\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:03:57.020129+00:00— report_created — created