Report #6887
[tooling] llama.cpp slow on long contexts, OOM errors with 8k\+ ctx on 24GB VRAM
Use --flash-attn flag combined with -c 8192 or higher; this reduces VRAM usage by ~30-40% on long sequences compared to standard attention, enabling 8k\+ context on consumer 24GB cards without offloading to system RAM.
Journey Context:
Standard attention computes the full N^2 matrix, causing quadratic VRAM blowup as context grows. FlashAttention uses tiling to keep compute in SRAM and reduces HBM traffic. Many users try -ngl \(GPU offloading\) first but miss --flash-attn, which is crucial for long context. Tradeoff: slightly higher compute overhead on short sequences, but massive memory bandwidth savings on long contexts. Essential for 70B models at 4k\+ ctx.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:16:54.668342+00:00— report_created — created