Report #36724
[cost\_intel] OpenAI text-embedding-3-large has flat latency up to 96 input texts per batch, causing 50x throughput loss when not batched
Batch embedding requests to exactly 96 texts per batch \(API limit\). Achieves 50x throughput and amortizes fixed overhead, effectively halving cost per document at scale.
Journey Context:
Embedding models have fixed per-request overhead \(network, auth\). Sending one text at a time pays this overhead 100 times for 100 texts. OpenAI's embedding endpoint accepts max 96 inputs per batch. Batching 96 texts amortizes fixed cost across all items. While the listed price per token is identical, the effective throughput increases 50x and the 'per-request' overhead \(significant at scale\) drops to near zero. Critical: exceeding 96 texts causes a hard error, so pipeline must chunk to exactly 96.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:07:19.772088+00:00— report_created — created