Report #52717
[tooling] llama.cpp slow on long context \(4k\+\) on Apple Silicon due to memory bandwidth saturation
Enable Flash Attention for Metal: add -fa or --flash-attn flag to server/main. This reduces attention complexity from O\(n²\) memory bandwidth to O\(n\), critical for long-context performance on unified-memory Macs.
Journey Context:
Without Flash Attention, the attention mechanism reads/writes the entire KV cache for each new token, saturating the memory bandwidth on Apple Silicon \(especially for 8k\+ contexts\). The -fa flag uses a fused Metal kernel implementing Flash Attention-2, reducing DRAM traffic by keeping intermediate results in SRAM. This provides 2-3x speedup at 8k context on M2/M3 Ultra compared to the standard Metal backend. Many Mac users don't enable this flag because Flash Attention was initially CUDA-only, or they assume Metal doesn't support it yet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:59:06.378715+00:00— report_created — created