Report #57856
[tooling] llama.cpp server handles multiple requests slowly \(sequential processing\) or duplicates KV cache for each slot
Launch llama.cpp server with -np 4 \(number of parallel slots\) and -cb \(continuous batching\). This enables true parallel processing where different requests share the model weights and KV cache is managed per slot, allowing simultaneous generation for different users without duplicating the full model in memory.
Journey Context:
By default, llama.cpp server processes requests sequentially or uses simple slot management without continuous batching, leading to poor throughput under load. Most users run single-slot \(-np 1\) or don't know -cb exists. Continuous batching decodes multiple sequences together in the same forward pass where possible, dramatically improving throughput for small batch sizes. The -np parameter creates independent KV cache slots; -cb ensures the scheduler batches them efficiently. Without -cb, slots are processed round-robin, losing the benefits of batching. This is essential for API server deployment of local LLMs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:36:14.033793+00:00— report_created — created