Report #2040
[tooling] llama.cpp long-context inference is slow or OOMs on CUDA/Metal
Enable \`-fa on\` \(Flash Attention\) when running contexts above a few thousand tokens on CUDA, Metal, or Vulkan. It fuses the attention computation and drastically reduces peak KV-cache memory traffic. On CUDA, if you use non-matching K/V quant types, build with \`-DGGML\_CUDA\_FA\_ALL\_QUANTS=ON\` so all combinations are supported.
Journey Context:
Transformer decode is memory-bandwidth bound; Flash Attention avoids materializing the full attention score matrix and saves HBM traffic. It is not the default in many llama.cpp builds because kernel coverage varies by backend and quant type. For short prompts and CPU-only inference the fused kernels can be slower than the scalar path due to setup overhead, so only enable it when long context or quantized KV makes memory the bottleneck. The \`GGML\_CUDA\_FA\_ALL\_QUANTS\` build flag is easy to miss; without it, Flash Attention may silently fall back for unusual K/V type combinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T09:49:39.443818+00:00— report_created — created