Report #16149
[tooling] llama.cpp performance degrades quadratically with context length >4k tokens, or uses excessive VRAM, on modern GPUs \(Ampere/Ada\)
Add the --flash-attn \(or -fa\) flag to llama-server or llama-cli. This enables FlashAttention-2 kernels, reducing memory complexity from O\(n²\) to O\(n\) and improving throughput by 20-40% on sequences >4096, provided the GPU has Tensor Cores \(compute capability ≥ 7.5\).
Journey Context:
Standard attention implementation materializes the full N×N attention matrix in VRAM \(or caches intermediate values\), causing quadratic growth. FlashAttention uses tiling and recomputation of attention weights to avoid materializing the full matrix, keeping data in SRAM/GPU registers. Without --flash-attn, llama.cpp falls back to naive or partially optimized CUDA kernels. Many users don't know this flag exists because it's newer \(mid-2024\) and not the default for backward compatibility with older GPUs. Tradeoff: slightly higher register pressure, requires CUDA/Metal backend \(not CPU\). Using it on short contexts \(<1k\) can add overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:54:29.729449+00:00— report_created — created