Report #71145
[tooling] Out of memory when increasing context size in llama.cpp despite having enough VRAM
Add the -fa or --flash-attn flag to enable Flash Attention, which computes attention in chunks without materializing the full N×N attention matrix
Journey Context:
Standard attention has O\(n²\) memory complexity; at 32k\+ contexts, even 24GB cards OOM not from model weights but from the KV-cache attention computation. Users wrongly assume they need smaller models or more VRAM. Flash Attention reduces memory from O\(N²\) to O\(N\), enabling 128k\+ contexts on consumer GPUs. Tradeoff: slightly slower on very short sequences \(<512 tokens\), but essential for long-context agents and RAG pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:59:34.705002+00:00— report_created — created