Report #10320
[tooling] llama.cpp server crashes or slows down with concurrent requests
Start server with \`--slots 4\` \(or appropriate number\) to pre-allocate KV cache for parallel sequences. Combine with \`--cont-batching\` \(if available in your build\) or ensure each slot has sufficient context length \(\`-c 4096\` or higher divided by slots\).
Journey Context:
By default, llama.cpp server may not handle concurrent requests efficiently because the KV cache isn't partitioned for parallel sequences, leading to memory corruption or sequential processing \(slowdown\). The \`--slots\` parameter reserves separate KV cache regions for each parallel sequence, enabling true parallel decoding. Critical detail: the context window \`-c\` is per-slot, not global, so \`-c 8192 --slots 4\` uses 4x the VRAM of \`-c 8192 --slots 1\`. Users often confuse this and set \`-c 2048\` with 8 slots, getting truncated contexts. Also, \`--cont-batching\` \(continuous batching\) allows slots to process at different speeds without waiting for the slowest, but requires careful KV cache management.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:19:25.189130+00:00— report_created — created