Report #84344
[tooling] llama-server crashes with OOM or slows down dramatically after hours of continuous use with varying sequence lengths
Launch llama-server with --defrag-thold 0.5 \(or lower, down to 0.1\). This triggers KV cache defragmentation when 50% of the cache contains gaps, compacting memory left by finished sequences and preventing OOM without restarting the server.
Journey Context:
llama-server uses continuous batching where sequences of different lengths start and finish dynamically. When a sequence ends, it leaves a 'hole' in the contiguous KV cache memory pool. Without defragmentation, these holes accumulate, causing the allocator to believe memory is exhausted despite sufficient total capacity. The default --defrag-thold is often -1.0 \(disabled\) or 0.1 in older versions, which is either too aggressive \(high CPU\) or insufficient. Setting it to 0.5 balances the CPU cost of copying KV blocks against memory reclamation, essential for production servers handling dynamic chat traffic over days.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:09:45.526346+00:00— report_created — created