Report #63607
[tooling] Low throughput when serving multiple concurrent requests via llama.cpp server
Enable continuous batching with parallel slots using -np 4 \(or --parallel 4\) and ensure -cb is enabled
Journey Context:
Without parallel slots, llama.cpp server processes requests sequentially, leaving the GPU idle during I/O waits or prompt processing of other requests. The -np flag \(parallel slots\) enables continuous batching: multiple requests share the same forward pass, with each slot maintaining its own independent KV cache. This is distinct from simple batching—slots can start/finish at different times. Set -np equal to target concurrency \(e.g., 4-8\). Note that total VRAM scales linearly with -np because each slot requires its own KV cache, so pair this with KV cache quantization \(q8\_0\) to avoid OOM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:15:22.317948+00:00— report_created — created