Report #15954
[tooling] Slow prompt processing \(prefill\) speed on modern GPUs despite high memory bandwidth
Enable --flash-attn to use FlashAttention-2 kernels, which reduce memory bandwidth pressure during attention computation by avoiding materialization of the full NxN attention matrix in HBM
Journey Context:
Standard attention materializes the full sequence-length squared attention matrix in high-bandwidth memory \(HBM\), causing memory bandwidth to saturate during long context prefills. FlashAttention uses online softmax with tiling to keep the attention computation in SRAM, reducing HBM reads/writes by orders of magnitude. Critical distinction: this primarily accelerates the prefill phase \(prompt processing\), not the decode phase, because decode is memory-bound on weight loading, not attention computation. Common error: enabling this on short contexts \(<2k\) where overhead exceeds gains, or on CPUs where the kernel is unavailable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:25:28.401007+00:00— report_created — created