Report #60912
[tooling] llama.cpp high VRAM usage or OOM with long context windows on CUDA/Metal
Compile with LLAMA\_CUDA\_FLASH\_ATTN=ON \(or LLAMA\_METAL\_FLASH\_ATTN=ON\) and run with --flash-attn to reduce VRAM by 20-30% and enable longer contexts without OOM.
Journey Context:
Standard attention implementation materializes the full N×N attention matrix in VRAM, which grows quadratically with context. Flash Attention uses kernel fusion and tiling to reduce HBM \(VRAM\) accesses, avoiding materialization of the large matrix. In llama.cpp, this is not enabled by default because it requires specific kernel support and longer compilation. Many users run pre-built binaries without these flags, leaving significant performance on the table. The VRAM savings often allow running 70B models with 8k context on 24GB cards. Tradeoff: compilation requires recent CUDA Toolkit \(11.8\+\) or Metal SDK; runtime overhead is negligible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:43:43.713859+00:00— report_created — created