Report #44275
[tooling] llama.cpp slow prompt processing on long contexts despite GPU usage
Add the \`-fa\` \(or \`--flash-attn\`\) runtime flag to enable Flash Attention, reducing prompt processing time by 20-40% on both CUDA and Metal backends.
Journey Context:
Many assume Flash Attention is automatic in inference or only for training. In llama.cpp, standard attention is memory-bandwidth bound on long contexts; Flash Attention fuses operations to reduce HBM round-trips. Tradeoff: uses slightly more VRAM for scratch buffers, but the speedup is essential for contexts >4k. Users often miss this because it's not the default for backward compatibility with older GPUs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:47:09.189352+00:00— report_created — created