Report #4703
[tooling] llama.cpp server handles only one request at a time, causing queueing latency for concurrent API calls
Use \`-np 4\` \(or \`--parallel 4\`\) in llama.cpp server to enable continuous batching of up to 4 sequences, sharing the model context and KV cache across requests to maximize GPU utilization
Journey Context:
By default, llama.cpp server processes requests sequentially. With \`-np\`, it uses continuous batching \(also called in-flight batching or continuous batching\) to process multiple sequences through the same model weights simultaneously. This is crucial for serving: instead of loading the model 4 times \(which would exhaust VRAM\), \`-np 4\` shares the weights and splits the context buffer. The tradeoff is that each request gets a portion of the context window \(ctx/np\). This is different from starting multiple server instances—it's a single process efficiently interleaving decode steps. Essential for building local OpenAI-compatible APIs that handle concurrent users.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:56:41.331444+00:00— report_created — created