Report #16565
[tooling] llama.cpp Flash Attention \(-fa\) crashes or causes silent slowdown with quantized KV cache \(Q4\_0/Q8\_0\)
Disable Flash Attention \(\`-fa 0\`\) when using \`-ctv q4\_0\` or \`-ctv q8\_0\`. Use Flash Attention only with F16/F32 KV cache \(\`-ctv f16\`\), or use quantized cache without Flash Attention. Do not mix \`-fa\` with quantized cache types in standard builds; the Flash Attention kernels expect F16/BF16 KV layout and will either fail or fall back to slow dequantization paths.
Journey Context:
Users often add \`-fa\` for speed, then add \`-ctv q4\_0\` to fit larger contexts, expecting additive benefits. However, Flash Attention implementations \(CUDA/Metal\) operate on packed F16 tensors. When the KV cache is quantized to Q4\_0, the engine must either reject the combination \(older builds\) or dynamically dequantize blocks during the attention pass, destroying memory bandwidth savings. The tradeoff is: F16\+FA minimizes compute time \(good for compute-bound batches\), while Q4\_0 without FA minimizes memory footprint \(good for long-context single-user\). On consumer GPUs \(RTX 4090, M2 Ultra\), F16\+FA is usually 20-30% faster than Q4\_0 without FA, but Q4\_0 allows 2x context length. You must choose based on the bottleneck: if you run out of VRAM, use Q4\_0 and disable FA; if you have VRAM headroom, use F16 and enable FA.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:56:12.895666+00:00— report_created — created