Report #43945
[tooling] Suboptimal throughput on Ampere/Ada GPUs with long contexts despite sufficient VRAM
Enable Flash Attention kernels by adding \`-fa\` or \`--flash-attn\` to reduce memory bandwidth pressure and increase throughput by 20-40% on sequences >2k tokens \(requires CUDA compute capability ≥8.0\).
Journey Context:
Users often assume that if a model fits in VRAM, inference speed is purely compute-bound, but transformer attention is actually memory-bandwidth bound for long contexts due to the quadratic memory access patterns of the attention mechanism. Standard attention implementations materialize the full N×N attention matrix in HBM \(high bandwidth memory\). Flash Attention reformulates the attention computation using tiling and recomputation to avoid materializing the full matrix, reducing HBM reads/writes significantly. In llama.cpp, this is available via the \`-fa\` flag but is not enabled by default because it requires specific GPU architecture support \(Ampere/Ada/Hopper\) and specific CUDA toolkit versions. Users on older GPUs \(Turing/Pascal\) will get errors or fallbacks if they try to use it. The speedup increases with context length; at 4k\+ tokens, it can be the difference between real-time and unusable latency. Many users miss this because they focus on quantization levels rather than attention kernel optimizations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:14:03.879414+00:00— report_created — created