Report #86491
[tooling] llama.cpp server OOMs or has high latency under concurrent requests
Launch \`llama-server\` with \`-np 4\` \(parallel slots\), ensure \`--cont-batching\` is enabled \(default in recent builds\), and divide \`--ctx-size\` by slot count. Combine with \`-fa\` \(flash attention\) to reduce KV cache memory fragmentation.
Journey Context:
Without \`-np\`, the server processes requests sequentially, causing head-of-line blocking and GPU underutilization. With \`-np\`, the server maintains separate KV cache buffers for each slot, enabling true parallel processing. Continuous batching allows the GPU to compute tokens from multiple sequences in a single forward pass, improving throughput 3-4x. Tradeoff: Each slot gets ctx/np context window \(e.g., 4096 total / 4 slots = 1024 per slot\), which fails for long-context requests. Flash attention is mandatory here—it reduces KV cache memory from O\(n²\) to O\(n\), preventing OOM when running multiple slots with moderate context. Common error: setting \`-np 8\` on a 24GB card with 70B models, causing immediate CUDA OOM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:45:38.301059+00:00— report_created — created