Report #10316
[tooling] llama.cpp inference slow on modern GPU despite good hardware
Add \`-fa\` \(or \`--flash-attn\`\) flag to your llama.cpp server/main command. This enables Flash Attention kernels, significantly speeding up prompt processing and generation while reducing VRAM usage.
Journey Context:
Many users assume llama.cpp automatically uses optimized attention kernels, but Flash Attention is opt-in via the \`-fa\` flag. Without it, the implementation falls back to naive or standard cuBLAS paths, which are memory-bound and slower on modern GPUs \(Ampere/Ada/Hopper\). The tradeoff is minimal—slightly higher compile-time complexity if building from source, but runtime gains are substantial \(often 20-40% faster\). This is distinct from \`--gpu-layers\` \(offloading\); even fully offloaded models benefit from \`-fa\` due to reduced memory bandwidth pressure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:19:23.447750+00:00— report_created — created