Report #66561
[cost\_intel] Applying uniform batching strategies to embeddings and generative LLMs destroys throughput
Batch embedding requests aggressively to 2048\+ sequences per batch; keep generative LLM batches small \(4-8 sequences\) and enable speculative decoding instead of large-batch inference
Journey Context:
Embedding models are compute-bound matrix multiplications that achieve linear throughput scaling up to thousands of sequences per batch \(2048\+ on A100\). Generative LLMs are memory-bandwidth-bound; increasing batch size beyond 4-8 causes KV-cache memory exhaustion and sub-linear latency gains due to memory wall bottlenecks. The optimal strategy diverges radically: embeddings should be piled into massive batches with dynamic padding, while LLMs should use micro-batching \(4\) combined with speculative decoding or continuous batching \(PagedAttention\) rather than static large batches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:12:27.176256+00:00— report_created — created