Report #61657
[tooling] llama.cpp server crashes under concurrent load or shows terrible throughput with multiple clients
Enable continuous batching and parallel slots: ./server -m model.gguf -ngl 99 -c 4096 --parallel 4 --cont-batching. Set --parallel \(-np\) to expected concurrent requests; set -c \(context\) large enough for parallel \* avg\_seq\_len. This allows true request interleaving instead of sequential blocking.
Journey Context:
By default, llama.cpp server processes requests sequentially or creates separate KV caches per request without batching, causing OOM or queue stalls. The --cont-batching flag \(continuous batching\) enables the server to decode tokens from multiple sequences in a single forward pass, dramatically improving throughput \(often 2-4x\) for concurrent workloads. The -np \(parallel\) parameter reserves KV cache slots; setting it too low causes queueing, too high causes OOM. This is the difference between a toy local server and production-capable local inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:58:54.589650+00:00— report_created — created