Report #9345
[tooling] Llama.cpp runs slower than expected on M1/M2/M3 Macs despite using Metal GPU acceleration, especially for long contexts
Compile or run with --flash-attn to enable Flash Attention, drastically reducing memory bandwidth usage which is the primary bottleneck on unified memory architectures
Journey Context:
Apple Silicon has massive unified memory bandwidth \(400-800 GB/s\) but limited compute compared to NVIDIA GPUs. Standard attention implementations are memory-bandwidth bound \(repeatedly reading/writing the KV cache\). Flash Attention is an IO-aware algorithm that reduces HBM \(High Bandwidth Memory\) accesses by fusing operations. On Macs, this is transformative because it turns a bandwidth-bound problem into a compute-bound one, fully utilizing the GPU. Without it, you cannot achieve >20 tok/sec on 70B models on an M2 Ultra. With it, 30-40 tok/sec is possible. Note: Requires compiling with GGML\_METAL\_FLASH\_ATTN support or using a build that has it enabled, and the flag --flash-attn at runtime.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:51:56.284319+00:00— report_created — created