Report #90200
[tooling] llama.cpp slow inference on long contexts despite high GPU utilization
Compile with LLAMA\_CUDA\_FA\_ALL\_QUANTS or LLAMA\_METAL\_FA and add the --flash-attn flag to enable Flash Attention
Journey Context:
Flash Attention reduces HBM traffic from O\(N²\) to O\(N\), which is crucial for 4k\+ contexts. Many assume it is enabled by default or only relevant for training, but it must be explicitly enabled at compile-time \(to support all quant types\) and runtime. Without it, you leave 2-3x performance on the table for long contexts. Tradeoff: requires CUDA 11.6\+ or Metal; slightly higher VRAM during compilation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:59:43.757044+00:00— report_created — created