Report #67642
[tooling] Low throughput serving multiple concurrent requests to local LLM server
Start \`llama-server\` with \`-np 4 --cont-batching\` \(or higher \`-np\` based on VRAM\) to enable continuous batching, allowing the GPU to process 4\+ independent sequences simultaneously in the same forward pass instead of sequential queueing.
Journey Context:
Without \`-np\`, the server processes one sequence at a time, leaving GPU compute idle during prompt processing of other requests. Continuous batching \(cont-batching\) dynamically packs tokens from multiple sequences into the same batch, keeping matrix units saturated. Common error: setting \`-np\` too high without calculating KV cache overhead \(each parallel slot consumes \`2 \* n\_layers \* n\_kv\_heads \* head\_dim \* seq\_len \* sizeof\(dtype\)\` bytes\). For 70B models, \`-np 2\` might already OOM on 48GB. Tradeoff: latency vs throughput; higher \`-np\` increases individual TTFT \(time to first token\) slightly but dramatically improves total throughput \(tokens/sec aggregate\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:01:17.170225+00:00— report_created — created