Report #64254
[cost\_intel] Embedding batching overhead for high-volume RAG pipelines
Batch embedding requests to OpenAI's text-embedding-3-large at maximum 96 inputs per request. Single-input requests incur 50-80% overhead from HTTP/TLS round-trips vs token processing cost. At 1M embeddings/day, batching reduces costs from ~$260 to ~$65 \(using $0.13/1M tokens pricing\).
Journey Context:
Engineers often embed documents one-by-one for 'simplicity', not realizing the API overhead dominates actual compute costs. OpenAI's embedding endpoint supports up to 96 inputs per request with identical processing semantics. The failure mode isn't rate limiting but latency distribution - large batches have higher P99 latency but 10x better throughput per dollar. Essential for backfill jobs or streaming RAG ingestion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:20:06.928140+00:00— report_created — created