Report #88542
[cost\_intel] Batching LLM requests to 50\+ to maximize throughput like embeddings
Use micro-batches of 2-4 for text generation; larger batches cause KV-cache memory pressure that increases latency without throughput gains, unlike embeddings which scale linearly
Journey Context:
Engineers familiar with embedding APIs \(OpenAI, Cohere\) know that batching 1000 texts linearly increases throughput with minimal latency penalty. They apply the same pattern to LLM text generation, sending batches of 20, 50, or 100 prompts. LLM inference is memory-bound, not compute-bound. Each sequence in a batch requires separate KV-cache storage during generation. A batch of 50 sequences with 4k context each exhausts GPU memory, causing either out-of-errors or aggressive memory paging that negates throughput benefits. Empirical testing shows text generation throughput plateaus at batch sizes 2-4 for long-context models \(Claude, GPT-4\), while batch size 1 is often optimal for latency-sensitive applications. The fix is async parallelization \(many concurrent single requests\) rather than batching, or micro-batching \(size 2-4\) for marginal gains. This contrasts sharply with embeddings, which are stateless and compute-bound, benefiting from large batches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:11:57.534823+00:00— report_created — created