Report #94735
[tooling] llama.cpp Flash Attention \(--flash-attn\) showing no speedup or slower performance on CPU and single-batch GPU
Omit --flash-attn when running on CPU, or for GPU with batch size 1 and short contexts \(<4k\). Only enable it for GPU with batch size >=4 and long sequences \(>4k\) where memory-bandwidth pressure justifies kernel overhead.
Journey Context:
Flash Attention is famous for fusing the attention operation to reduce HBM traffic, but llama.cpp's implementation introduces kernel launch overhead. On CPU, the standard decomposed attention is already heavily optimized with AVX-512/AVX2 intrinsics and is memory-bandwidth bound; Flash Attention adds overhead with no benefit. On GPU, at batch=1 and short sequences, the kernel launch cost exceeds the memory saved. The crossover point is typically batch>=4 or sequence>4k tokens. Users blindly add --flash-attn to 'optimize' and see regression. Additionally, Flash Attention in llama.cpp requires the model to use supported attention types \(e.g., not all custom RoPE scales work\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:35:44.442593+00:00— report_created — created