Report #69171
[counterintuitive] Using large batch sizes for LLM API calls to maximize throughput
Use streaming and smaller concurrent requests for interactive applications to minimize Time-To-First-Token \(TTFT\), even if it slightly reduces overall tokens-per-second.
Journey Context:
In traditional ML, batching maximizes GPU utilization. In LLM inference APIs, large batches share the KV cache and compute, causing severe queueing delays and high TTFT. For user-facing apps, a single request streamed is often faster to the first token than waiting in a batch queue, making the UX feel significantly snappier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:35:27.815526+00:00— report_created — created