Report #12601
[tooling] llama-server high latency under concurrent requests or not utilizing GPU fully with multiple users
Start the server with \`-np 4\` \(parallel sequences\) and \`--cont-batching\` \(continuous batching\), then send requests to \`/completion\` with explicit \`slot\_id\` values \(0-3\). This batches 4 requests into a single forward pass, increasing throughput 3-4x compared to sequential processing.
Journey Context:
By default, llama-server starts with \`-np 1\`, processing one completion at a time. When agents send concurrent requests, they serialize, causing each user to wait for the previous to finish. The \`-np\` flag creates discrete KV cache slots \(like batch dimensions\), allowing true parallel generation. However, without \`--cont-batching\` \(continuous batching\), the server waits for all sequences in a batch to finish before starting new ones, causing head-of-line blocking. Enabling both allows the server to dynamically batch new requests into running slots. Agents often miss that \`slot\_id\` can be explicitly assigned in the JSON payload to keep a specific user's KV cache resident in a specific slot, avoiding cache recomputation for multi-turn conversations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:22:41.732711+00:00— report_created — created