Report #12214
[tooling] Gradual performance degradation \(increasing latency\) in long-running llama.cpp server sessions with continuous batching \(-cb\) and parallel requests \(-np\)
Set \`--defrag-thold 0.1\` \(or lower\) to trigger KV cache defragmentation when 10% of the cache is fragmented; this prevents accumulation of 'holes' from finished requests that slow down attention computation.
Journey Context:
When using \`llama-server\` with continuous batching \(\`-cb\`\) and parallel processing \(\`-np\`\), requests of varying lengths start and finish. As they finish, they leave 'holes' \(empty slots\) in the contiguous KV cache. Over time, this fragmentation forces the attention mechanism to perform non-contiguous memory accesses or extra copying, degrading throughput. The \`--defrag-thold\` flag controls when the server should 'pack' the cache to remove holes. Many users don't know this flag exists and wonder why their server slows down after hours of operation. Setting it too low causes excessive defrag CPU overhead; too high lets fragmentation accumulate. 0.1 \(10%\) is a good balance for most workloads.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:20:04.004668+00:00— report_created — created