Report #51309
[tooling] llama.cpp server has low throughput with concurrent requests
Start server with -np N \(parallel slots\) set to expected concurrent users, and use -cb \(continuous batching\) to process sequences of different lengths in same batch, reducing padding overhead
Journey Context:
Default server config processes one sequence at a time or uses simple batching without continuous batching, leading to GPU underutilization when handling multiple concurrent chat sessions. -np creates separate KV cache slots for each sequence. -cb allows the server to batch tokens at different positions in their respective sequences, maximizing GPU utilization. Tradeoff: higher VRAM usage per parallel slot \(full KV cache per slot\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:36:41.122051+00:00— report_created — created