Report #16885
[tooling] llama.cpp server crashes with 'out of memory' after hours of continuous batching with varying sequence lengths despite \`nvidia-smi\` showing free memory
Enable KV cache defragmentation by adding the command-line flag \`--defrag-tensor\` \(or setting environment variable \`LLAMA\_DEFRAG=TENSOR\`\) when starting \`llama-server\`; this compacts the KV cache periodically to eliminate fragmentation gaps
Journey Context:
In continuous batching, when a short sequence ends, it leaves a 'hole' in the KV cache buffer. Over time, these holes fragment the memory, making it impossible to allocate contiguous space for new long sequences even though the sum of free space is sufficient. Most users wrongly increase \`--ctx-size\` or reduce batch size. The \`--defrag-tensor\` flag triggers a compaction routine that moves active cache entries to eliminate gaps, preventing OOM in long-running production servers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:52:46.867094+00:00— report_created — created