Report #52930
[tooling] llama.cpp OOM or slowdown with Flash Attention on long contexts despite using -fa
Add -fa \(flash attention\) AND -nkvo \(no KV offload\) to keep KV cache in VRAM, avoiding CPU-GPU sync bottleneck; only use -nkvo if context fits in VRAM.
Journey Context:
Users enable -fa expecting automatic long-context handling but miss that KV cache offloading to CPU \(-nkvo disables this\) creates a sync bottleneck that negates FA's benefits. The tradeoff is VRAM usage vs speed. If context doesn't fit, you must shard across GPUs or use mmap, but for single-GPU long context, -fa -nkvo is the critical combo. Alternatives like sparse attention aren't in llama.cpp yet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:20:21.302666+00:00— report_created — created