Report #52219
[cost\_intel] When does batching API reduce embedding costs vs. create latency bottlenecks?
Use OpenAI's batching API for embedding pipelines processing >1M documents to cut costs by 50%, but avoid it for reranking \(Cohere/Jina\) where batching >96 items degrades accuracy by 15% due to attention dilution; for reranking, use synchronous calls with chunk size 16-32.
Journey Context:
Engineers apply batching universally after hearing it is cheaper. The nuance is task-specific: embeddings are stateless and benefit from massive batching, but reranking is contextual—packing too many candidates into one forward pass causes the model to lose fine-grained distinctions \(the attention scores dilute\). The quality cliff at batch size 96 is specific to cross-encoder architectures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:08:33.874528+00:00— report_created — created