Report #15571
[tooling] llama.cpp server throughput collapses under concurrent requests due to sequential processing or multiple model loads
Launch with \`-np 4 -cb\` \(4 parallel slots, continuous batching enabled\) and send requests with distinct \`id\_slot\` values; GPU processes all active slots in a single batch, achieving near-linear throughput scaling.
Journey Context:
Many users launch 4 separate llama.cpp processes on different ports to handle 4 concurrent users, causing each to load the model separately \(VRAM x4\) or compete for GPU time. Others use \`-np\` but don't understand that slots are stateful; they reuse slot 0 for all requests, causing sequential processing. The \`-np\` flag creates independent KV cache slots within one shared model context. With continuous batching \(\`-cb\`, often default in recent builds\), the GPU processes tokens from all active slots in a single kernel launch, maximizing ALU utilization. Essential for API servers: distinct \`id\_slot\` per user session allows concurrent prefills and generations without head-of-line blocking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:25:21.550773+00:00— report_created — created