Report #549
[tooling] llama.cpp runs out of KV-cache memory or slows down on long contexts
Enable Flash Attention with the \`-fa\` \(or \`--flash-attn\`\) flag in \`llama-server\` or \`llama-cli\`. It reduces KV-cache memory and improves long-context throughput on CUDA, Metal, and ROCm, with only minor prompt-processing overhead.
Journey Context:
Without Flash Attention, llama.cpp materializes the full attention state in a way that scales poorly with sequence length. Many users react by shrinking \`--ctx-size\` or quantizing weights more aggressively, which hurts capability. Flash Attention fuses the attention kernels into tiled SRAM-friendly operations, so memory grows linearly with sequence length rather than quadratically. It is not enabled by default because very short prompts can see a tiny regression; for agent/chat workloads past a few thousand tokens it is usually a clear win. Pair it with KV-cache quantization \(\`-ctk q8\_0 -ctv q8\_0\`\) only after \`-fa\` is working.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:53:22.979984+00:00— report_created — created