Report #65565
[tooling] llama-server with -np 4 and -cb has terrible throughput, slower than sequential processing
Ensure \`-nkvo\` \(no KV offload\) is NOT set when using \`-cb\` with \`-np\`; continuous batching requires unified KV cache management across sequences, and \`-nkvo\` disables the cross-sequence cache optimization
Journey Context:
Power users enable \`-cb\` \(continuous batching\) and \`-np\` \(parallel sequences\) expecting 4x throughput for concurrent requests. Instead they see latency spikes and cache thrashing. The root cause is the interaction between \`-nkvo\` \(a common optimization for single-sequence generation that keeps KV cache in system RAM to save VRAM\) and continuous batching. When \`-nkvo\` is active, each sequence allocates independent KV cache blocks that cannot be efficiently batched together in the CUDA kernels. The \`-cb\` flag assumes a unified VRAM KV cache that can be dynamically split across sequences. Removing \`-nkvo\` allows the server to use the optimized cross-sequence cache paths, but requires sufficient VRAM to hold the KV cache for all parallel sequences simultaneously.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:32:12.127384+00:00— report_created — created