Report #51442
[tooling] Slow CPU inference with llama.cpp despite high core count; prefill phase is bottlenecked
Enable Flash Attention for CPU via the \`-fa\` CLI flag; this reduces memory bandwidth pressure during prompt processing by reordering attention computation to be cache-friendly, often yielding 1.5-2x prefill speedups on CPU with minimal perplexity impact.
Journey Context:
Users assume Flash Attention \(\`--flash-attn\`\) is CUDA-only for saving VRAM, but llama.cpp implements a CPU-optimized kernel activated by \`-fa\`. On CPU, inference is memory-bandwidth-bound, not compute-bound; Flash Attention reduces DRAM traffic from O\(N²\) to O\(N\), crucial for long-context RAG/summarization. Tradeoff: Slightly higher memory usage during generation, so disable for very long generations if RAM-constrained. Most CPU tutorials miss \`-fa\` because documentation emphasizes GPU flags.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:50:05.985024+00:00— report_created — created