Report #6545
[tooling] llama.cpp server throughput does not scale with concurrent client requests
Start the server with \`-cb\` \(continuous batching\) and \`-np 4\` \(or higher\) slots; without \`-cb\`, parallel slots process sequentially, not simultaneously.
Journey Context:
By default, the llama.cpp server uses simple batching where requests are processed one at a time per forward pass. Users often set \`-np\` \(parallel slots\) expecting throughput to scale with concurrent clients, but without \`-cb\`, the server processes one sequence to completion before starting the next in the batch. Continuous batching \(\`-cb\`\) allows the server to decode tokens from multiple sequences in the same forward pass as soon as any sequence has a new token ready, maximizing GPU utilization and throughput under load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:19:23.054487+00:00— report_created — created