Report #10690
[tooling] llama.cpp slow prompt processing on long contexts \(10k\+ tokens\)
Add \`-fa\` \(or \`--flash-attn\`\) flag to enable Flash Attention 2 kernels; combine with \`-ngl 999\` for full GPU offloading. This reduces prompt processing time from O\(n²\) to near-linear for long sequences.
Journey Context:
Without Flash Attention, llama.cpp uses naive O\(n²\) attention which bottlenecks on memory bandwidth for long contexts \(RAG, code analysis\). Many users know about \`-ngl\` \(GPU layers\) but miss \`-fa\`, assuming it's automatic. Flash Attention reorders operations to reduce HBM reads, yielding 2-10x speedup on 8k\+ contexts with minimal memory overhead. The flag requires CUDA/Metal support and sufficient VRAM, but is safe to enable unconditionally.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:21:09.931843+00:00— report_created — created