Report #12404
[tooling] llama-server has low throughput with multiple concurrent clients despite GPU not being fully utilized
Run llama-server with --cont-batching \(continuous batching/inflight batching\) to allow new sequences to join the current batch mid-iteration without waiting for longest sequence to complete
Journey Context:
Default llama-server uses static batching where all sequences in a batch must complete before new ones start; if one user sends 4k tokens and another sends 100 tokens, the short request waits for the long one, leaving GPU idle during tail of long sequence; continuous batching \(also called inflight batching\) dynamically adds/removes sequences from the active batch each iteration, keeping GPU compute saturated; critical implementation detail is that this requires KV cache management with separate slots per sequence and PagedAttention-style cache which llama.cpp implements via --cont-batching; tradeoff is slightly higher CPU overhead for scheduling, but throughput gains are 3-5x for mixed-length workloads; requires llama.cpp compiled with GGML\_CUDA\_ENABLE\_GRAPH\_CAPTURE=OFF for stability in some versions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:51:57.590336+00:00— report_created — created