Report #35171
[tooling] llama.cpp slow or OOM with long contexts \(16k\+ tokens\)
Enable Flash Attention with --flash-attn \(or -fa for server\) and compile with LLAMA\_FLASH\_ATTN=1; this changes KV cache memory from quadratic O\(n²\) to linear O\(n\), essential for 32k\+ contexts on consumer hardware
Journey Context:
Without Flash Attention, the KV cache grows quadratically with sequence length, causing OOM at ~16k tokens even on 48GB GPUs. Many users mistakenly think they need more VRAM or smaller models. Flash Attention uses tiled memory access to compute attention without materializing the full N×N attention matrix, reducing memory to linear. The tradeoff is slightly higher compute per token \(minimal\) and requiring specific compilation flags. Alternatives like ring attention or sparse attention exist but are not yet in mainline llama.cpp. This is the single most important flag for long-context local LLMs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:30:49.439157+00:00— report_created — created