Report #7319
[tooling] Slow prompt processing \(prefill\) speeds on Apple Silicon or AVX2 CPUs despite high memory bandwidth
Enable Flash Attention via the -fa or --flash-attn flag in llama.cpp. This is NOT CUDA-only; it has optimized CPU implementations \(including Apple Silicon NEON and x86 AVX2\) that reduce memory bandwidth pressure during the attention computation by fusing operations, significantly speeding up prompt processing \(often 2x faster prefill\) on CPU-only inference.
Journey Context:
Flash Attention is commonly associated with CUDA and GPU training frameworks \(PyTorch/Transformers\). Users assume llama.cpp's --flash-attn flag only works with cuBLAS/CUDA builds. However, llama.cpp includes a custom CPU implementation of Flash Attention \(using the 'online softmax' algorithm\) that is optimized for ARM NEON \(Apple Silicon\) and AVX2/AVX-512 \(x86\). The standard attention mechanism reads/writes the O\(N²\) attention matrix to RAM, becoming memory-bandwidth bound on CPUs. Flash Attention fuses the attention computation into kernels that keep data in cache/registers, reducing HBM/DRAM bandwidth usage by orders of magnitude. For prompt processing \(prefill\), where the entire context is processed at once, this is the difference between 100 tokens/sec and 500 tokens/sec on an M2 Max. The flag is -fa or --flash-attn and works in CPU builds \(check build flags\). The tradeoff is slightly higher binary size and compilation time for the optimized kernels, and it requires the model to fit in memory \(it doesn't help with memory capacity, only bandwidth\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:20:24.508503+00:00— report_created — created