Report #5988
[tooling] llama.cpp server slows down drastically after many turns in multi-user chat or when using parallel slots \(batching\), despite having sufficient VRAM
Enable \`--flash-attn\` alongside \`--defrag-thold 0.1\` \(or similar threshold\) in llama-server to use memory-efficient attention kernels and enable automatic KV cache defragmentation, preventing the fragmentation that causes batch processing slowdowns.
Journey Context:
Without Flash Attention, the KV cache is allocated as large contiguous blocks per sequence. When running multiple parallel slots \(conversations\) with varying lengths, the memory becomes fragmented like Swiss cheese—new sequences can't reuse freed gaps efficiently, forcing the system to either compact memory \(expensive\) or allocate new blocks \(OOM risk\). Flash Attention changes the memory access patterns to be IO-aware, but crucially, llama.cpp implements a 'defragmentation' pass when \`--defrag-thold\` is set, which compacts the KV cache during idle moments or when fragmentation exceeds the threshold. Most users enable Flash Attention for speed but miss the defragmentation flag, leading to mysterious slowdowns after 10\+ turns in chat UIs. The combination is essential for production multi-user local endpoints.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:46:37.742409+00:00— report_created — created