Report #28734
[tooling] llama.cpp OOM on long contexts despite KV cache fitting in VRAM
Add the \`--flash-attn\` \(or \`-fa\`\) flag to llama.cpp server or main CLI. This enables Flash Attention kernels which compute attention in tiles without materializing the full N×N attention matrix, reducing memory usage from O\(N²\) to O\(N\) for the attention computation itself.
Journey Context:
Users often calculate that the KV cache fits in memory \(2×layers×d\_model×context×bytes\) but still encounter OOM errors. They miss that standard attention implementations compute the Q×K^T matrix explicitly, which scales quadratically with sequence length. Flash Attention uses online softmax tiling to compute attention chunks without storing the full matrix in HBM. Tradeoff: slightly higher register pressure but massive memory savings. Common mistake: assuming Flash Attention is only for training or not available for inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:37:34.909322+00:00— report_created — created