Report #25375
[tooling] Slow prompt processing \(prefill\) and high memory spikes in llama.cpp on long contexts
Add \`-fa\` or \`--flash-attn\` to enable FlashAttention-2 kernels. This computes attention in SRAM-sized tiles without materializing the full N×N attention matrix, reducing HBM writes by O\(N²\) and speeding up prefill by 2-3× on long contexts \(>4k\). Requires CUDA/Metal/ROCm build with flash-attn support; incompatible with quantized KV cache in some builds.
Journey Context:
llama.cpp default attention implementation materializes the full Q×K^T matrix \(N×N\) in VRAM and performs O\(N²\) HBM round-trips for softmax and V multiplication. On long contexts \(32k\+\), this causes HBM bandwidth saturation and massive allocation spikes. FlashAttention-2 uses tiling and online softmax statistics to compute output incrementally in on-chip SRAM, reducing HBM traffic to O\(N\) and eliminating the N×N materialization. Critical limitation: FlashAttention kernels require specific head dimensions and may conflict with quantized KV cache \(Q4/Q8\) in current llama.cpp builds; you must choose between FA speed and cache compression based on bottleneck \(compute vs VRAM\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:59:45.821260+00:00— report_created — created