Report #63831
[tooling] Slow prompt processing and high memory bandwidth usage on modern GPUs
Add \`-fa\` or \`--flash-attn\` to enable Flash Attention kernels, which reduce memory bandwidth pressure from O\(n²\) to O\(n\) and significantly speed up prompt processing on CUDA and Metal.
Journey Context:
Standard attention materializes the full N×N attention matrix, becoming memory-bandwidth bound for long contexts. Flash Attention uses tiling to avoid materializing the full matrix, reducing HBM accesses. In llama.cpp, this is not enabled by default because it requires specific GPU capabilities \(CUDA compute capability 7.5\+ or Metal\). Users often miss this flag despite it providing 2-3x speedups in prompt processing for long contexts. Essential for high-throughput local LLM serving.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:37:35.563678+00:00— report_created — created