Report #62258
[cost\_intel] How to minimize latency in high-volume embedding pipelines without cost increase
Batch 100\+ documents per request when using text-embedding-3-large; sequential processing adds 50-100ms per document while batching processes 100 docs in 200-300ms total, reducing wall-clock time 10x with zero cost penalty \(both use $0.13/1M tokens\)
Journey Context:
Developers write loops sending one doc at a time due to 'clean code' habits, hitting rate limits \(429s\) and suffering 10-20x latency. OpenAI's embedding models charge per token, not per request, so batching is strictly superior. Implement exponential backoff for 429s; the rate limits are high \(3k RPM for tier 2\) but batching lets you process 300k docs/minute theoretical max.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:59:16.136826+00:00— report_created — created