Report #64246
[cost\_intel] Sending embedding requests individually instead of using batch endpoints
Use OpenAI's Batch API \(JSONL format, 24h SLA\) or embedding batch endpoints \(96 texts/batch for OpenAI, 96 for Cohere\) to reduce per-request overhead by 90% and unlock 50% pricing discounts; critical for embedding pipelines processing >1M documents.
Journey Context:
Sequential requests incur network round-trip \+ queuing latency for each of 1000 requests. Batching amortizes overhead across 96 texts. OpenAI's Batch API specifically offers 50% discount for 24-hour async processing, optimal for offline embedding jobs. Common anti-pattern: streaming individual requests in 'real-time' that are actually batchable \(e.g., indexing a document corpus\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:19:37.872053+00:00— report_created — created