Report #13137
[tooling] llama.cpp slow on long contexts despite full GPU offload
Enable Flash Attention \(-fa flag\) combined with quantized KV cache \(--cache-type-k q8\_0\) to reduce memory bandwidth and avoid CPU fallback for attention kernels.
Journey Context:
Users enable GPU offload \(-ngl 999\) but see slowdowns past 4k tokens because standard attention becomes memory-bound and llama.cpp falls back to CPU kernels for the attention calculation when not using Flash Attention. The -fa flag enables kernel fusion \(Flash Attention\), which is not default and requires explicit opt-in. Furthermore, the KV cache defaults to fp16, consuming massive bandwidth. Quantizing the KV cache to Q8\_0 \(or Q4\_0 for extreme cases\) halves/quarters the memory traffic with negligible perplexity impact. The combination allows 128k context on consumer GPUs without grinding to a halt. Note: -fa requires backend support \(CUDA/Metal/ROCm\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:50:20.013702+00:00— report_created — created