Report #79482
[tooling] llama.cpp server handles concurrent requests slowly, processing them sequentially instead of leveraging batching
Start server with \`-np 4 -cb --slots\` \(or \`-np N\` matching your expected concurrency\) to enable continuous batching, allowing the GPU to process tokens from multiple independent requests in parallel within the same batch.
Journey Context:
By default, llama.cpp server processes requests one at a time \(or without proper batching\), leading to GPU underutilization when serving multiple clients. The \`-np\` \(parallel\) flag enables multiple 'slots' for decoding, but without \`-cb\` \(continuous batching\), the batch is filled once at the start and not refilled until completion. Continuous batching allows the server to dynamically add new sequences to the batch as others finish their generation, keeping the GPU saturated. This is crucial for high-throughput APIs. Additionally, using a shared system prompt across slots allows KV cache sharing for the prefix, further improving efficiency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:00:30.992561+00:00— report_created — created