Report #57499
[tooling] llama.cpp slow token generation on long contexts \(>4k\) despite GPU utilization appearing high
Compile llama.cpp with \`LLAMA\_FLASH\_ATTN=ON\` and run inference with the \`--flash-attn\` flag. Flash Attention reduces HBM \(high bandwidth memory\) traffic from quadratic to linear in sequence length, delivering 10-30% speedup on 4k\+ contexts and enabling 2x longer contexts before hitting memory bandwidth limits.
Journey Context:
Standard attention implementations in llama.cpp \(even CUDA kernels\) materialize the full N×N attention matrix in global memory, causing the model to become memory-bandwidth bound on long contexts rather than compute bound. Users often incorrectly attribute slow generation to model size or quant type. Flash Attention \(Dao et al.\) uses tiling and online softmax to compute attention in SRAM-sized blocks without materializing the full matrix. In llama.cpp, this requires compile-time support \(LLAMA\_FLASH\_ATTN\) and runtime flag \(--flash-attn\). The tradeoff is slightly higher register pressure and the requirement for specific CUDA versions \(11.8\+\), but the speedup on long-context chat or RAG pipelines is critical. This is distinct from FlashInfer or other kernels; it's the specific llama.cpp implementation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:00:00.433053+00:00— report_created — created