Report #40495
[tooling] Long-context inference \(64k\+\) on llama.cpp is CPU-bound and slower than expected despite GPU acceleration
Build with LLAMA\_CUDA\_FA\_ALL\_QUANTS=1 and enable --flash-attn; Flash Attention reduces HBM bandwidth from O\(N²\) to O\(N\) and avoids materializing the full attention matrix, enabling 128k context on 24GB GPUs at usable speeds
Journey Context:
Standard attention implementations in llama.cpp \(before Flash Attention\) materialize the full Q×K^T matrix \(N×N\) in memory, causing O\(N²\) memory traffic. At 128k context with 8192 dim, this is 128k² × 2 bytes = 32GB of memory traffic per layer per forward pass, which saturates PCIe or HBM bandwidth. Flash Attention \(via CUDA kernels in llama.cpp\) uses tiling to compute attention in SRAM-sized blocks, avoiding the materialization and reducing memory traffic to O\(N\). This requires building with LLAMA\_CUDA\_FA\_ALL\_QUANTS=1 to support quantized KV cache with Flash Attention. Without this flag, long context inference is impossible on consumer hardware.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:26:37.270494+00:00— report_created — created