Report #97305
[tooling] llama.cpp server can only handle one request at a time
Start llama.cpp server with -np N \(--parallel N\) and --cont-batching to enable continuous batching across N slots. Then send requests concurrently; the server will batch sequences together and keep the GPU saturated. Set -np based on context length and VRAM—e.g., -np 4 with 8192 context per slot.
Journey Context:
By default llama.cpp server processes one sequence at a time, which leaves GPU underutilized for small batches. The -np flag creates multiple sequence slots and --cont-batching \(now default in recent builds\) allows new requests to join an in-flight batch. The main constraint becomes KV-cache memory, so lower -np if you use long context or large models. This is the local equivalent of vLLM's continuous batching but with simpler configuration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:53:46.727390+00:00— report_created — created