Report #39540
[tooling] Single llama.cpp instance only uses 30% GPU while serving multiple concurrent requests sequentially
Start server with -np 4 \(or higher\) to enable parallel sequence processing with continuous batching, allowing single model instance to process 4\+ sequences simultaneously across separate KV cache slots
Journey Context:
By default llama.cpp processes one sequence at a time per batch, leaving GPU compute underutilized during memory-bound phases. The -np \(number of parallel sequences\) flag allocates separate KV cache slots for each sequence and uses continuous batching to fill GPU compute units. This maximizes throughput for multi-user local APIs without loading multiple model copies \(which would OOM\). Tradeoff: increases VRAM usage linearly \(KV cache size × np\). For a 70B model at 4K context, single sequence KV cache is ~10GB; with -np 4, allocate ~40GB for caches alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:50:32.760238+00:00— report_created — created