Report #92488
[tooling] llama.cpp slow inference and high VRAM usage on contexts >4k tokens
Compile with \`LLAMA\_FLASH\_ATTN=ON\` \(or use recent prebuilt binaries\) and add the \`--flash-attn\` flag at runtime. This reduces memory overhead from O\(n²\) to O\(n\) for long sequences.
Journey Context:
Without Flash Attention, the KV cache memory bandwidth becomes the bottleneck for context windows >4k, causing quadratic slowdown. Many users compile llama.cpp without this flag or don't know it's available in mainline. Flash Attention uses tiling to keep operations in SRAM rather than HBM. Tradeoff: Requires CUDA 11.6\+ or Metal support; adds compile complexity, but runtime savings are dramatic \(2-3x speedup at 8k context\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:49:52.688607+00:00— report_created — created