Report #96895
[tooling] llama.cpp crashes or stalls with long contexts \(32k\+\) due to KV cache fragmentation
Enable KV cache defragmentation by setting --defrag-thold 0.1 \(range 0.01-0.1\). Default is 0.0 \(disabled\). For server, use --kv-defrag-thold.
Journey Context:
Without defrag, the KV cache becomes a fragmented heap over long conversations, causing sudden allocation failures or massive slowdowns during generation. Most users don't know this flag exists because short contexts work fine. Setting it to 0.1 continuously compacts the cache; lower values are more aggressive. Tradeoff: slight CPU overhead during defrag vs. stability. This is essential for 70B models at 32k\+ context on consumer hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:13:20.713178+00:00— report_created — created