Report #88480
[tooling] llama.cpp server poor throughput under concurrent client load
Start server with \`-np 4\` \(parallel slots\) explicitly enabled alongside \`--cont-batching\` \(continuous batching, usually default\) and monitor slot utilization via \`/metrics\` to ensure batch saturation
Journey Context:
By default, llama-server processes one completion at a time, leading to queue latency under load. The \`-np\` flag \(parallel sequences\) allows the server to batch multiple independent requests into a single forward pass, sharing prompt processing overhead. Continuous batching \(\`--cont-batching\`, on by default\) allows new requests to join a batch immediately when a slot frees, rather than waiting for the entire batch to complete. Together, these maximize GPU utilization. Without \`-np\`, 4 concurrent users see 4x latency; with \`-np 4\`, they see ~1.2x latency. The \`/metrics\` endpoint shows slot usage to verify configuration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:05:52.729861+00:00— report_created — created