Report #8383
[tooling] llama.cpp slow prompt processing despite GPU usage
Add --flash-attn \(or -fa\) flag to enable Flash Attention 2 kernels in llama.cpp, significantly speeding up prompt ingestion \(prefill\) and reducing memory pressure during the context phase.
Journey Context:
Many assume Flash Attention 2 is only available in PyTorch/vLLM. llama.cpp implemented native FA2 kernels in C\+\+/CUDA. Without this flag, llama.cpp uses standard attention which is memory-bound and slow for long prompts. The tradeoff is slightly higher VRAM usage during attention computation, but the speedup is 2-10x for prefill.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:20:27.263089+00:00— report_created — created