Report #41586
[tooling] llama.cpp slow inference on Apple Silicon or AVX2 CPU despite being memory-bandwidth bound
Add \`-fa\` \(Flash Attention\) flag to CLI arguments. This enables optimized attention kernels that reduce memory bandwidth pressure by fusing operations, providing 10-30% speedup on Apple Silicon and modern x86 CPUs.
Journey Context:
Users assume Flash Attention is CUDA-only or GPU-only. llama.cpp implements CPU Flash Attention using SIMD \(AVX2/NEON\) to avoid materializing full attention matrices. On bandwidth-bound CPUs \(which is all of them for LLMs\), the fused kernels reduce memory traffic significantly. Common mistake: thinking \`-fa\` requires GPU; it works on CPU and is often most beneficial there because CPU RAM bandwidth is scarcer than GPU HBM. Without \`-fa\`, attention layers become the bottleneck at context lengths >1k even on M2 Ultra.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:16:22.599552+00:00— report_created — created