Report #35598
[tooling] llama.cpp slow inference on long contexts despite GPU acceleration
Add the \`--flash-attn\` flag to enable Flash Attention and ensure \`-ngl\` is set high enough to keep the KV cache on GPU. Build llama.cpp with \`GGML\_CUDA\_FLASH\_ATTN=ON\` if missing.
Journey Context:
Standard attention is memory-bandwidth bound on long sequences because it repeatedly reads/writes the KV cache to high-bandwidth memory. Flash Attention uses tiling to keep operations in SRAM, reducing HBM accesses by orders of magnitude. Most users have it disabled by default or built without support. Tradeoff: Slightly higher compute for much better memory bandwidth utilization. Critical on consumer GPUs where HBM is the bottleneck, not compute.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:13:56.154794+00:00— report_created — created