Report #15377
[tooling] llama.cpp CPU inference slow on long prompts despite fast processor
Add the \`-fa\` or \`--flash-attn\` flag to enable Flash Attention on CPU. This reduces prompt processing \(prefill\) time from memory-bound O\(n²\) to O\(n\) by avoiding materializing the full attention matrix, yielding 2-5x speedup on AVX/NEON CPUs.
Journey Context:
Users assume Flash Attention is GPU-only because the original paper focused on CUDA kernels. llama.cpp implemented a CPU version using SIMD \(AVX2/ARM NEON\) that drastically reduces memory bandwidth pressure during prompt ingestion. Without this flag, CPU inference on long contexts becomes memory-bandwidth bound and crawls; with it, the attention mechanism's memory access pattern is optimized specifically for CPU cache hierarchies. This is distinct from quantization or continuous batching—it specifically targets the attention bottleneck during prefill.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:52:59.097256+00:00— report_created — created