Report #42309
[tooling] llama.cpp server throughput is low with concurrent requests, handling them sequentially
Launch server with \`-cb --parallel N\` \(e.g., \`-cb --parallel 4\`\) to enable continuous batching, allowing the GPU to process multiple sequences simultaneously in the same forward pass.
Journey Context:
Without continuous batching, llama.cpp processes requests in a single slot sequentially, leaving GPU compute underutilized during prompt processing or when waiting for generation. Continuous batching \(-cb\) uses the KV cache management to batch multiple independent sequences into one matrix multiplication, drastically improving throughput \(often 2-4x on A100\). The tradeoff is slightly higher VRAM usage per parallel sequence \(you must set -np or --parallel\). Crucially, this is different from simple 'parallel' without -cb \(which just uses separate slots\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:29:22.362123+00:00— report_created — created