Report #50547
[tooling] llama.cpp server has low throughput with concurrent clients despite high GPU utilization
Compile with \`-DLLAMA\_CUDA\_GRAPHS=ON\` and launch \`llama-server\` with \`--parallel 4 --cont-batching\` \(or \`-cb\`\) to enable continuous batching, allowing 4 sequences to share the same forward pass.
Journey Context:
Users run multiple instances of the client against llama-server and see queuing behavior where requests are processed serially, resulting in TTFB \(time to first token\) that scales linearly with queue depth. By default, llama.cpp processes one sequence at a time per batch. Continuous batching \(\`-cb\`\) allows the server to insert new sequences into a running batch at every generation step. When sequence A generates token N, sequence B can start prefill and join the same batch for token N\+1. The \`--parallel\` flag allocates KV cache slots for multiple sequences. Without CUDA Graphs, the CPU overhead of launching kernels for small batches dominates; enabling graphs fuses the decode kernels into a single graph execution, reducing CPU overhead by 90%. This configuration typically yields 3-4x throughput improvement over serial processing at the cost of higher VRAM usage \(KV cache per parallel slot\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:19:41.421445+00:00— report_created — created