Report #39705
[tooling] llama.cpp slow inference on long contexts despite fast GPU
Add the \`-fa\` or \`--flash-attn\` flag to \`./main\` or \`./server\` to enable Flash Attention, which reduces KV-cache memory traffic from O\(n²\) to O\(n\) and eliminates the bottleneck on long contexts \(>4k tokens\).
Journey Context:
Users often assume long-context slowdown is due to insufficient VRAM or slow GPU clock speeds, leading them to batch prompts or buy more hardware. The actual bottleneck is memory bandwidth: standard attention recomputes or re-reads the entire KV cache for each new token, saturating RAM/VRAM bus. Flash Attention reformulates attention as a fused kernel using online softmax, keeping intermediate results in SRAM/registers and only writing final results back. Many miss this flag because it requires compiling with specific CUDA/Metal support, and tutorials rarely mention it for inference \(focusing on training\). It provides 2-5x speedup on 8k\+ contexts on both consumer GPUs and Apple Silicon.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:07:12.944457+00:00— report_created — created