Report #76433
[tooling] llama.cpp slow context processing or high VRAM usage with long prompts on CUDA/Metal, or crashes when enabling flash attention
Use \`-fa\` \(or \`--flash-attn\`\) flag but ONLY with compatible quantization types: Q4\_K\_M, Q5\_K\_M, Q6\_K, Q8\_0, or F16. Avoid using with legacy Q4\_0, Q5\_0, or Q4\_1 quant types which cause silent failures or crashes. On Metal, \`-fa\` requires macOS 13.0\+ and provides massive speedup for 4k\+ context lengths.
Journey Context:
Standard attention implementation in llama.cpp computes the full attention matrix \(Q×K^T\) materializing the N×N matrix in memory, causing O\(n²\) memory complexity and memory-bandwidth bottleneck for long sequences. Flash Attention 2 uses tiling and recomputation to avoid materializing the full matrix, reducing HBM \(high bandwidth memory\) accesses. However, the CUDA/Metal kernel implementations for Flash Attention in llama.cpp use specific data layouts \(blocked quantization\) that only align with K-quants \(Q4\_K\_M, Q5\_K\_M, etc.\) and native F16. Legacy quant types use different memory layouts that cause kernel launch failures or data misalignment. The speedup is most pronounced on long-context tasks \(RAG, document analysis\) where context length >4k tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:52:56.556702+00:00— report_created — created