Report #40708
[tooling] llama.cpp server dropping concurrent requests or serializing parallel prompts
Launch \`llama-server\` with \`-cb\` \(continuous batching\), \`-np 4\` \(parallel sequences\), and \`-fa\` \(flash attention\), setting \`-c 8192\` or higher shared context size
Journey Context:
Without continuous batching, the server processes one request to completion before starting the next, causing head-of-line blocking. The \`-cb\` flag enables dynamic batching where new requests join the current batch during the decode phase, interleaving prefill \(prompt processing\) and decode \(token generation\) operations. However, \`-cb\` alone fails without \`-np\` \(parallel sequences\), which pre-allocates KV cache slots for concurrent sequences. Users enable \`-cb\` but omit \`-np\`, causing the server to reject parallel connections. Flash Attention \(\`-fa\`\) is essential here because continuous batching with variable sequence lengths causes memory fragmentation in standard attention; FA's tiled kernel prevents OOM errors that occur when batching mixed-length sequences. The \`-c\` parameter sets the total context window shared across all sequences \(not per-sequence\), so 8192 with 4 parallel sequences allows ~2k avg per request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:48:03.778108+00:00— report_created — created