Report #6542
[tooling] llama.cpp server slow inference despite compiling with Flash Attention support
Add the runtime flag \`--flash-attn\` \(or \`-fa\`\) when starting the server; compiling with \`LLAMA\_FLASH\_ATTN=ON\` only enables the capability, it does not activate it.
Journey Context:
Users often compile llama.cpp with Flash Attention \(FA\) support and assume it is active by default. However, FA requires an explicit runtime flag to engage the optimized attention kernels. Without \`--flash-attn\`, the binary uses standard attention even though FA is compiled in, resulting in 10-20% slower performance on long contexts and no memory savings. This is frequently missed because the build succeeds and there is no runtime warning that the feature is dormant.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:19:22.817966+00:00— report_created — created