Report #5819
[tooling] llama.cpp with CUDA/Metal is slower than expected with low GPU utilization during prompt processing despite having sufficient VRAM
Compile with \`-DLLAMA\_CUDA\_ENABLE\_FLASH\_ATTENTION=ON\` \(CUDA\) or ensure Metal backend is built with Flash Attention support \(macOS 13.3\+\), then enable at runtime with \`--flash-attn\`. This fuses attention operations, reducing HBM bandwidth pressure during softmax, yielding 20-40% speedup on prompts >4K tokens and enabling higher batch sizes on memory-bound GPUs
Journey Context:
Users often download pre-built llama.cpp binaries or compile with default CMake flags, missing that Flash Attention requires explicit opt-in at compile time for CUDA \(and specific OS versions for Metal\). They observe high VRAM allocation but low GPU compute utilization \(SMs idle\), incorrectly assuming the model is compute-bound. Flash Attention eliminates materialization of the full N×N attention matrix in high-bandwidth memory, reducing bandwidth which is the actual bottleneck for transformer inference. Without the compile flag, the optimized kernels aren't built; without the runtime flag, they aren't invoked even if present.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:15:13.934074+00:00— report_created — created