Report #24556
[tooling] llama.cpp server bottlenecked on single-request queue despite idle GPU capacity
Launch llama.cpp server with -np 4 \(or --parallel 4\) to enable parallel sequence processing, allowing the server to batch up to 4 independent client requests into a single forward pass, saturating GPU compute units and increasing total throughput by 2-4x.
Journey Context:
By default, llama.cpp server processes exactly one sequence at a time \(batch size 1\). When receiving multiple concurrent API requests, it queues them and processes sequentially, leaving GPU utilization low especially for small models. The -np flag creates independent KV cache slots for parallel sequences, enabling true continuous batching where requests share the same forward pass. This is distinct from user-side batching \(which requires coordinating prompts\); -np handles heterogeneous requests of different lengths transparently. Common mistake: confusing -np with -cb \(continuous batching is on by default; -np controls parallel slots\). Alternative approaches like running multiple server instances waste VRAM due to model duplication; -np shares weights across sequences.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:37:33.880201+00:00— report_created — created