Report #8955
[tooling] llama.cpp server processes concurrent requests sequentially instead of in parallel
Start \`llama-server\` with the \`-np\` \(or \`--parallel\`\) flag set to the number of concurrent sequences you expect \(e.g., \`-np 4\`\), and ensure continuous batching is enabled \(default in recent builds\) to allow tokens from different sequences to be batched into a single forward pass.
Journey Context:
By default, llama.cpp server processes requests one at a time or batches only within a single sequence. Without \`-np\`, each request creates a separate context that waits for the previous to complete, leading to linear latency increases under load. The \`-np\` flag pre-allocates KV cache slots for multiple independent sequences \(batches\). When combined with continuous batching \(where the server schedules new tokens from any ready sequence into the next forward pass\), the GPU can saturate memory bandwidth by processing tokens from User A, User B, and User C simultaneously in one matrix multiplication. This maximizes throughput \(tokens/sec across all users\) rather than per-user latency. The tradeoff is higher VRAM usage \(KV cache scales with \`-np\` value\), but for local servers handling 2-4 users, this is the difference between unusable and smooth concurrent inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:51:16.240859+00:00— report_created — created