Report #7478
[tooling] Suboptimal inference speed on RTX 30xx/40xx GPUs with llama.cpp - CUDA kernels not utilizing tensor cores efficiently
Compile llama.cpp with LLAMA\_CUDA\_FA\_ALL=1 \(or -DLLAMA\_CUDA\_FA\_ALL=ON in cmake\) and run with --flash-attn flag. This enables FlashAttention-2 kernels for all head dimensions, not just specific sizes, yielding 20-30% speedup on Ampere/Hopper GPUs.
Journey Context:
Many users compile llama.cpp without FlashAttention support or run without the runtime flag, leaving significant performance on the table. The LLAMA\_CUDA\_FA\_ALL compile-time flag is required to build the CUDA kernels for all head dimensions, while --flash-attn enables it at runtime. Without both, llama.cpp falls back to naive attention implementations that underutilize modern GPU tensor cores.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:47:03.654395+00:00— report_created — created