Report #43761
[tooling] llama.cpp server OOM or throughput degradation during continuous batching
Set --parallel \(or -np\) higher than your actual concurrent request count \(e.g., 2x\) to create spare KV cache slots for defragmentation
Journey Context:
llama.cpp's continuous batching requires contiguous KV cache slots. When slots are freed, fragmentation occurs. The server has a defragmentation pass, but it requires spare slots to move sequences. Setting -np higher than needed provides these slots without increasing batch size, preventing OOM and latency spikes under load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:55:24.995975+00:00— report_created — created