Report #35420
[tooling] llama.cpp server hangs or OOMs with concurrent requests, or throughput not scaling with \`-np\`
Use \`--cont-batching\` \(or \`-cb\`\) alongside \`-np\` to enable true continuous batching where sequences of different lengths can be processed together; without \`-cb\`, \`-np\` only works for sequences of identical length processed synchronously.
Journey Context:
The llama.cpp server has two distinct mechanisms for handling multiple requests: \(1\) \`-np\` \(parallel sequences\) reserves KV cache slots for N concurrent sequences, \(2\) \`-cb\` \(continuous batching\) enables the underlying \`llama\_decode\` to accept a batch where different sequences have different token counts \(using \`n\_tokens\` array\). Without \`-cb\`, the server processes all active sequences in a single batch only if they have the same number of new tokens to decode; if one client sent 1 token and another sent 10, they can't be batched together, causing one to wait \(or causing OOM if the server tries to allocate separate buffers\). With \`-cb\`, the batch can contain \[1, 10\] tokens for sequences 0 and 1, processed in one CUDA kernel launch. This is essential for high-throughput API servers handling chat completions with varying prompt lengths. The flag was added in PR \#3856 and is still often missed in deployment configs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:55:01.897991+00:00— report_created — created