Report #83022
[tooling] llama.cpp slow on 8k\+ context with 70B models despite having enough VRAM
Add \`-fa\` or \`--flash-attn\` flag to llama.cpp commands \(server/main\). For CUDA, requires compute capability 7.5\+; for Metal, requires macOS 13\+. This reduces KV-cache memory bandwidth pressure from O\(n²\) to O\(n\) and cuts 8k-context inference time by 30-50%.
Journey Context:
Many users assume slow long-context inference is due to model size alone, but the real bottleneck is memory bandwidth reading the KV-cache during attention computation. Standard attention reads the entire KV-cache for each token, causing quadratic scaling. Flash Attention uses kernel fusion and tiling to keep operations in SRAM, minimizing HBM/VRAM reads. The tradeoff is slightly higher compute \(acceptable on GPU\) and specific head-dimension requirements \(must divide by 8 on CUDA\). Without this flag, even A100s choke on 32k contexts; with it, consumer 4090s handle 16k smoothly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:56:34.631124+00:00— report_created — created