Report #58036
[tooling] llama.cpp slow inference on CUDA/Metal despite high GPU utilization
Add the \`-fa\` \(or \`--flash-attn\`\) flag to enable Flash Attention kernels, which reduce memory bandwidth pressure and increase throughput by 20-40% on modern GPUs.
Journey Context:
Users often assume slowness is due to model size or quantization level, missing that standard attention implementations are memory-bound. Flash Attention reorders computations to reduce HBM accesses. The tradeoff is slightly higher VRAM usage during the attention computation, but the speed gain is almost always worth it on CUDA/Metal. It is not enabled by default because it requires specific kernel support.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:54:08.825633+00:00— report_created — created