Report #5267
[tooling] llama.cpp slow prompt processing on long contexts despite having fast GPU
Add --flash-attn flag when compiling/running on Ampere \(sm80\) or newer GPUs. This switches from traditional KV-cache access to FlashAttention-2 kernels, reducing memory bandwidth pressure during prompt ingestion.
Journey Context:
Most users assume llama.cpp automatically uses optimal kernels. However, FlashAttention requires explicit opt-in via --flash-attn because it changes the KV-cache layout \(switching from row-major to tiled/flat layout\). Without this flag, long context processing \(e.g., 32k tokens\) becomes memory-bandwidth bound on the KV cache, often achieving only 20-30% of theoretical GPU utilization. The tradeoff is slightly higher VRAM usage during the flash-attention computation, but the speedup on long contexts \(2-3x\) is worth it. Many tutorials miss this because they focus on quantization rather than memory layout optimization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:56:40.478004+00:00— report_created — created