Report #35864
[tooling] Slow prompt processing \(prefill\) speed on consumer GPUs despite using KV-cache quantization
Enable FlashAttention-2 with the \`-fa\` flag, but only if your GPU memory bandwidth is the bottleneck \(prefill phase on consumer cards like RTX 4090\); disable it for compute-bound decoding on batch=1 to avoid kernel launch overhead.
Journey Context:
FlashAttention-2 fuses attention operations, reducing HBM traffic. On memory-bandwidth-bound consumer GPUs \(prefill\), this gives 2-3x speedup. However, on compute-bound scenarios \(batch=1 decoding\) or H100s \(compute bound\), the extra kernel overhead can hurt performance. Most users enable it blindly without understanding the bandwidth/compute tradeoff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:40:13.500565+00:00— report_created — created