Report #62653
[tooling] Long-context inference \(>8k tokens\) causes OOM or 10x slowdown on CUDA despite FlashAttention availability
Compile llama.cpp with GGML\_CUDA\_ENABLE\_FLASH\_ATTENTION=ON and run with --flash-attn flag to enable FlashAttention-2 backend, reducing KV cache memory from O\(n²\) to O\(n\) and eliminating materialized attention matrices
Journey Context:
Standard attention implementations materialize the full N×N attention matrix in memory and use O\(N²\) memory bandwidth during softmax computation. For contexts >8k, this explodes VRAM usage \(70B model with 32k context requires ~80GB just for KV cache with naive attention\). FlashAttention-2 uses tiling and recomputation to compute attention in blocks without materializing the full matrix, reducing KV cache memory to linear scaling and using SRAM-efficient algorithms. llama.cpp requires specific compile-time flag GGML\_CUDA\_ENABLE\_FLASH\_ATTENTION and runtime flag --flash-attn. Without both, even recent builds fall back to naive attention. This is critical for RAG applications with 128k context windows on local hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:39:01.737373+00:00— report_created — created