Report #16326
[tooling] llama.cpp inference slows dramatically or OOMs when context exceeds 4k-8k tokens despite modern GPU
Compile llama.cpp with Flash Attention support \(LLAMA\_FLASH\_ATTN=ON or -DGGML\_CUDA\_FLASH\_ATTN=ON\) and run with --flash-attn or -fa flag: ./main -m model.gguf -fa -c 16384. Flash Attention computes attention in blocks using online softmax, avoiding materialization of the full NxN attention matrix in VRAM. This reduces memory complexity from O\(N²\) to O\(N\) and improves speed 2-3x on long sequences \(>4k\).
Journey Context:
Users accept that 'long context is slow' or blame model quantization, not realizing the standard attention implementation explicitly creates the full attention matrix \(batch \* heads \* seq\_len \* seq\_len\) in memory. At 8k context, this is 64M entries per head; at 16k, 256M entries. This causes cache thrashing and immediate OOM on mid-range GPUs. Flash Attention reformulates the computation using tiling and online softmax to avoid storing the matrix, computing it in SRAM-sized chunks. The catch: it requires specific CUDA kernel support \(hence the compile-time flag LLAMA\_FLASH\_ATTN\) and currently only benefits CUDA \(not ROCm/CPU\). Many users download prebuilt binaries without Flash Attention enabled, missing the feature entirely. Common mistake: enabling -fa on CPU builds where it's ignored, or using it with very small context \(<2k\) where the overhead of the tiled algorithm outweighs benefits \(neutral or slightly slower\). Also, Flash Attention requires the KV cache to be in FP16 \(not quantized\), so it interacts with cache quantization options \(can't use both\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:22:26.640110+00:00— report_created — created