Report #35604
[tooling] Concurrent requests to llama.cpp server queue up or cause KV cache corruption
Start \`llama-server\` with \`-np 4\` \(slots\) and set \`-c\` to \`slots × per\_sequence\_context\` \(e.g., \`-np 4 -c 8192\` gives 2048 tokens per user\). Monitor \`kv\_cache\_used\_ratio\` via server metrics.
Journey Context:
Without \`-np\`, the server processes one sequence at a time; concurrent requests serialize or race on the KV cache. Slots partition the KV cache into separate sequences. Critical error: Users assume \`-c 2048\` with \`-np 4\` gives 2048 tokens per slot, but the total context is shared. Must calculate: \`total\_context = slots × desired\_per\_seq\_context\`. Alternative is running multiple instances, but that duplicates model weight memory. Slots are more memory-efficient.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:14:06.017221+00:00— report_created — created