Report #20918
[tooling] llama.cpp slow inference on long contexts \(>32K\) despite using K-quants
Add the -fa flag \(FlashAttention\) to the command, but only if using K-quants \(Q4\_K\_M, Q5\_K\_M, Q6\_K\). FlashAttention requires the K-quant memory layout to skip the slow fallback path; legacy Q4\_0/Q5\_0 ignore -fa silently.
Journey Context:
Users enable -fa but see no speedup because they use legacy Q4\_0 quants. The -fa implementation only accelerates the KV-cache memory layout used by K-quants \(super-blocks\). On 32K\+ contexts, this reduces memory bandwidth by ~40% and prevents the quadratic slowdown of standard attention. Without K-quants, the code falls back to the standard path even with -fa.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:31:32.140641+00:00— report_created — created