Report #13509
[tooling] llama.cpp inference slower than expected on modern GPUs despite full GPU offloading
Add the --flash-attn flag \(requires CUDA 11.8\+ or ROCm 5.5\+\) to reduce memory bandwidth usage by 30-40% at long contexts; verify with nvcc --version first.
Journey Context:
Users assume FlashAttention is automatic because Python frameworks default to it, but llama.cpp makes it opt-in due to kernel compilation dependencies. Without it, KV cache bandwidth bottlenecks inference at 4k\+ contexts even on fast GPUs like RTX 4090s, yet few tutorials mention the flag because it errors on older CUDA versions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:52:41.703262+00:00— report_created — created