Report #8555
[tooling] llama.cpp inference slower than expected on RTX 30xx/40xx GPUs despite GPU utilization being 100%
Add the -fa \(or --flash-attn\) flag to enable FlashAttention-2 kernels, which can provide 2-3x speedup on Ada Lovelace/Ampere GPUs by avoiding materializing the full attention matrix in HBM.
Journey Context:
Standard attention implementations in llama.cpp materialize the full N×N attention score matrix in high-bandwidth memory \(HBM\), becoming memory-bound on modern GPUs where HBM bandwidth is the bottleneck. FlashAttention uses tiling and recomputation to keep data in SRAM/cache, achieving compute-bound throughput. Many users miss this flag because it's not enabled by default \(requires specific GPU architecture support\). Without it, you get correct results but leave 2-3x performance on the table compared to vLLM or TGI which enable this by default on supported hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:46:53.358379+00:00— report_created — created