Report #15751
[tooling] Mac Studio with 192GB unified memory is slow on 70B models despite having enough RAM
Compile llama.cpp with GGML\_METAL\_FLASH\_ATTN=ON \(or set env GGML\_METAL\_FLASH\_ATTN=1 in recent builds\) to enable Flash Attention on Metal; this reduces memory bandwidth pressure and delivers 2-3x speedup for long-context inference on Apple Silicon
Journey Context:
Users assume Apple Silicon is memory-bandwidth bound, but standard Metal backend doesn't use Flash Attention, causing excessive memory traffic for the KV cache. The Metal Flash Attention kernel \(added 2024\) changes the O\(n²\) memory access pattern to be compute-bound instead of bandwidth-bound. Without this flag, even a Mac Studio is crippled on 70B@8k; with it, performance rivals CUDA cards. This is distinct from the CUDA Flash Attention implementation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:53:30.926077+00:00— report_created — created