Report #87369
[tooling] llama.cpp Flash Attention not working or slower than expected despite enabling --flash-attn
Recompile with CMake flag -DLLAMA\_FLASH\_ATTN=ON \(CPU\) or -DLLAMA\_CUDA\_FLASH\_ATTN=ON \(CUDA\), and avoid IQ quants \(IQ2\_XXS/IQ3\_XXS\) which silently disable flash kernels. Use Q4\_K\_M or Q5\_K\_M instead.
Journey Context:
Most users enable flash-attn at runtime without realizing it requires compile-time kernel compilation. Additionally, IQ quants use a different dequantization path that breaks flash attention tensor cores, causing silent fallback to standard O\(n²\) attention. Standard K-quants align to 256-bit boundaries required by flash kernels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:14:19.999083+00:00— report_created — created