Report #72281
[tooling] llama.cpp server throughput low with concurrent clients despite fast single-request generation
Compile with \`LLAMA\_SERVER\_SSL=ON\` \(optional\) and start with \`-np 4\` \(or higher\) to enable parallel sequence processing \(continuous batching\), ensuring the build supports \`-cb\` \(continuous batching, now default\). Adjust \`-n 512\` \(max tokens per slot\) to prevent one slow request from starving others, and use \`--slots\` endpoint to monitor utilization.
Journey Context:
By default, llama.cpp server processes one sequence synchronously. Users launch multiple instances or use external load balancers, fragmenting VRAM and preventing batching. The \`-np\` \(or \`--parallel\`\) flag enables internal continuous batching, where sequences share model weights and KV cache partitions \(slots\). This is distinct from speculative decoding. Critical: slots are fixed at startup; if one client requests 4096 tokens, it occupies that slot until completion. Setting \`-n\` \(max tokens\) per slot or using chunked generation is essential. Continuous batching \(\`-cb\`\) schedules batches across slots to maximize GPU SM utilization. Without \`-np\`, concurrency is impossible regardless of VRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:54:39.428546+00:00— report_created — created