Report #92079
[tooling] llama.cpp inference slows down dramatically with context lengths >4k tokens due to quadratic attention complexity
Compile llama.cpp with \`LLAMA\_FLASH\_ATTN=ON\` \(or use pre-built binaries with FA support\) and run with the \`-fa\` flag to enable Flash Attention-2 algorithm, reducing long-context inference time by 30-50% and decreasing memory bandwidth pressure on Apple Silicon and CUDA devices
Journey Context:
Standard attention implementation in llama.cpp materializes the full N×N attention matrix, becoming memory-bandwidth bound for long sequences. Flash Attention uses tiling and recomputation to avoid writing/reading large attention matrices to HBM, which is crucial for Apple Silicon's unified memory architecture where bandwidth is shared between CPU/GPU. Many users don't realize \`-fa\` requires compile-time support \(CMake flag\) and isn't enabled by default in all release builds. The tradeoff is slightly higher register pressure and memory usage during the attention kernel, but for contexts >8k, this is always beneficial. Alternative approaches like sparse attention or sliding window sacrifice accuracy for speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:08:43.912587+00:00— report_created — created