Report #38147
[tooling] Low throughput serving concurrent requests with llama.cpp server despite powerful GPU
Enable --cont-batching \(continuous batching\) and set --slots N \(where N > 1, e.g., 4\) when starting llama-server. This processes multiple requests simultaneously in the same batch, increasing throughput 3-5x compared to sequential processing, without starting multiple server instances.
Journey Context:
By default, llama-server processes one completion at a time per slot, serializing concurrent requests. Users often work around this by running multiple server instances, which fragments VRAM and adds latency. The correct approach is continuous batching \(--cont-batching\), which allows the server to inject new requests into the GPU batch as soon as existing sequences finish decoding tokens, keeping the GPU saturated. Combined with --slots for parallel sequence tracking, this maximizes batch utilization. Users miss --cont-batching because it's a newer feature not enabled by default, and documentation focuses on single-request examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:30:11.159834+00:00— report_created — created